Coasters
From Java CoG Kit
This page is an informal description and a place for persistent discussions about coasters. You need an account to edit it. Send an email to either (hategan@mcs.anl.g ov) or gregor@<the same place> if you need an account.
Description
Coasters are an "alternative" implementation of the concept of Condor Glide-ins. It is meant to be simple to use (to the extent the problem permits), self-deployable, and portable. The main purpose is to allow efficient resource utilization by dynamically grouping of multiple jobs into single resource manager requests, therefore reducing the amount of time jobs spend in queues.
In principle most queuing systems do not support scheduling based on very finely grained time requests. Priorities are instead managed using very coarse grained time requirements. For example "jobs shorter than 1h can be submitted to the 'fast' queue". This leaves little room for, say, 20 minute jobs. Coasters allow 3 such 20 minute jobs to be treated as a single 1h job without the need to statically specify more than a maximum wall time.
An additional benefit is speed. Coasters use an efficient communication library which allows the multiplexing of multiple requests through a single secure TCP connection, which means that authentication (an expensive operation) can be shared by multiple job requests.
Basic Flow
The following terms are used:
- Submit Host - the machine from which the job is submitted (i.e. "your machine")
- Remote Host - the host that the job is submitted to (typically a cluster)
- Boot Handler - the first part in the coaster job manager string (e.g. gt2:gt2:PBS)
- Remote Handler - the second part in the coaster job manager string (e.g. gt2:gt2:PBS)
- Remote Job Manager - the third (and optional) part in the coaster job manager string (e.g. gt2:gt2:PBS)
- Local Resource Manager (LRM) - a queuing system on the Remote Host (e.g. PBS, SGE)
- Remote Resource Manager (RRM) - a system used to submit jobs to a Remote Host (e.g. GRAM)
Here's a description of what happens when a job is submitted using the coaster provider:
- If not already started, a simple http service, called the Bootstrap Service, is started on the Submit Host:

- If not already started, a Messaging Service is started on the Submit Host:

- If not already started, a Coaster Service is started on the Remote Host. Starting up the Coaster Service involves the following steps:
- A Bootstrap Script is submitted to the Remote Host using the Boot Handler (typically through the fork job manager):

- The Bootstrap Script downloads and verifies a Bootstrap Application (a jar file) from the Bootstrap Server and then starts it:

- The Bootstrap Application fetches a list of files that compose the Coaster Service from the Bootstrap Server, downloads and caches them, and launches the Coaster Service on the Remote Host:

- The Coaster Service opens a persistent Message Exchange Link to the Messaging Service on the Submit Host:

- A Bootstrap Script is submitted to the Remote Host using the Boot Handler (typically through the fork job manager):
- The job description is submitted through the Message Exchange Link to the Coaster Service:

- The Coaster Service submits, if necessary, a Coaster Worker job to the Remote Host using the Remote Handler and the Remote Job Manager. This job can be submitted either to the RRM or directly to the LRM:

- After starting, the Coaster Worker opens a Secondary Message Exchange Link to the Coaster Service:

- The job description is submitted to the Coaster Worker through the Secondary Message Exchange Link:

- The Coaster Worker runs the job through fork/exec:

- When the job completes, the Coaster Worker notifies the Coaster Service of the job completion through the Secondary Message Exchange Link, which in turn notifies the Client of the job completion through the Message Exchange Link:

- Subsequent jobs are sent directly to a Coaster Worker through the Message Exchange Links:

Notes
- The primary Message Exchange Link is typically a secure connection that uses GSI Sockets. It can be used for concurrent submissions of multiple jobs.
- The Secondary Message Exchange Link is typically insecure (plain TCP) and is (theoretically) only routed inside a cluster
- This scheme assumes that the Coaster Worker can connect to the Coaster Service through TCP
- This scheme also assumes that the Coaster Service can connect to the Messaging Service on the Client

