<?php  
            include_once( $_SERVER['DOCUMENT_ROOT']."/static/includes/common.inc.php" );
            do_html_header("Documentation");
        ?><div id="content">
<div class="navheader">
<table width="100%" summary="Navigation header"><tr>
<td width="20%" align="left">
<a accesskey="p" href="creating_workflows.php">Prev</a> </td>
<td width="60%" align="center"><a accesskey="h" href="index.php">Table of Contents</a></td>
<td width="20%" align="right"> <a accesskey="n" href="execution_environments.php">Next</a>
</td>
</tr></table>
<hr>
</div>
<div class="chapter" title="Chapter 5. Running Workflows">
<div class="titlepage"><div><div><h2 class="title">
<a name="running_workflows"></a>Chapter 5. Running Workflows</h2></div></div></div>
<div class="toc"><dl>
<dt><span class="section"><a href="running_workflows.php#executable_workflows">5.1. Executable Workflows (DAG)</a></span></dt>
<dt><span class="section"><a href="running_workflows.php#mapping_refinement_steps">5.2. Mapping Refinement Steps</a></span></dt>
<dt><span class="section"><a href="running_workflows.php#data_staging_configuration">5.3. Data Staging Configuration</a></span></dt>
<dt><span class="section"><a href="running_workflows.php#pegasuslite">5.4. PegasusLite</a></span></dt>
<dt><span class="section"><a href="running_workflows.php#pegasus-plan">5.5. Pegasus-Plan</a></span></dt>
<dt><span class="section"><a href="running_workflows.php#BasicProperties">5.6. Basic Properties</a></span></dt>
</dl></div>
<div class="section" title="5.1. Executable Workflows (DAG)">
<div class="titlepage"><div><div><h2 class="title" style="clear: both">
<a name="executable_workflows"></a>5.1. Executable Workflows (DAG)</h2></div></div></div>
<p>The DAG is an executable (concrete) workflow that can be executed
    over a variety of resources. When the workflow tasks are mapped to
    multiple resources that do not share a file system, explicit nodes are
    added to the workflow for orchestrating data. transfer between the
    tasks.</p>
<p>When you take the DAX workflow created in <a class="link" href="creating_workflows.php" title="Chapter 4. Creating Workflows">Creating Workflows</a>, and plan it for a
    single remote grid execution, here a site with handle <span class="bold"><strong>hpcc</strong></span>, and plan the workflow without clean-up nodes,
    the following concrete workflow is built:</p>
<div class="figure">
<a name="concepts-fig-dag"></a><p class="title"><b>Figure 5.1. Black Diamond DAG</b></p>
<div class="figure-contents"><div class="mediaobject" align="center"><table border="0" summary="manufactured viewport for HTML img" cellspacing="0" cellpadding="0"><tr><td align="center" valign="middle"><img src="images/concepts-diamond-dag.png" align="middle" alt="Black Diamond DAG"></td></tr></table></div></div>
</div>
<p><br class="figure-break"></p>
<p>Planning augments the original abstract workflow with ancillary
    tasks to facility the proper execution of the workflow. These tasks
    include:</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem"><p>the creation of remote working directories. These directories
        typically have name that seeks to avoid conflicts with other
        simultaneously running similar workflows. Such tasks use a job prefix
        of <code class="code">create_dir</code>.</p></li>
<li class="listitem"><p>the stage-in of input files before any task which requires these
        files. Any file consumed by a task needs to be staged to the task, if
        it does not already exist on that site. Such tasks use a job prefix of
        <code class="code">stage_in</code>.If multiple files from various sources need to
        be transferred, multiple stage-in jobs will be created. Additional
        advanced options permit to control the size and number of these jobs,
        and whether multiple compute tasks can share stage-in jobs.</p></li>
<li class="listitem"><p>the original DAX job is concretized into a compute task in the
        DAG. Compute jobs are a concatination of the job's <span class="bold"><strong>name</strong></span> and <span class="bold"><strong>id</strong></span>
        attribute from the DAX file.</p></li>
<li class="listitem"><p>the stage-out of data products to a collecting site. Data
        products with their <span class="bold"><strong>transfer</strong></span> flag set
        to <code class="literal">false</code> will not be staged to the output site.
        However, they may still be eligible for staging to other, dependent
        tasks. Stage-out tasks use a job prefix of
        <code class="code">stage_out</code>.</p></li>
<li class="listitem"><p>If compute jobs run at different sites, an intermediary staging
        task with prefix <code class="code">stage_inter</code> is inserted between the
        compute jobs in the workflow, ensuring that the data products of the
        parent are available to the child job.</p></li>
<li class="listitem"><p>the registration of data products in a replica catalog. Data
        products with their <span class="bold"><strong>register</strong></span> flag set
        to <code class="literal">false</code> will not be registered.</p></li>
<li class="listitem"><p>the clean-up of transient files and working directories. These
        steps can be omitted with the <span class="command"><strong>--no-cleanup</strong></span> option
        to the planner.</p></li>
</ul></div>
<p>The <a class="link" href="reference.php" title="Chapter 10. Reference Manual">" Reference Manual"</a> Chapter
    details more about when and how staging nodes are inserted into the
    workflow.</p>
<p>The DAG will be found in file <code class="filename">diamond-0.dag</code>,
    constructed from the <span class="bold"><strong>name</strong></span> and <span class="bold"><strong>index</strong></span> attributes found in the root element of the
    DAX file.</p>
<pre class="programlisting">######################################################################
# PEGASUS WMS GENERATED DAG FILE
# DAG diamond
# Index = 0, Count = 1
######################################################################

JOB create_dir_diamond_0_hpcc create_dir_diamond_0_hpcc.sub
SCRIPT POST create_dir_diamond_0_hpcc /opt/pegasus/default/bin/pegasus-exitcode create_dir_diamond_0_hpcc.out

JOB stage_in_local_hpcc_0 stage_in_local_hpcc_0.sub
SCRIPT POST stage_in_local_hpcc_0 /opt/pegasus/default/bin/pegasus-exitcode stage_in_local_hpcc_0.out

JOB preprocess_ID000001 preprocess_ID000001.sub
SCRIPT POST preprocess_ID000001 /opt/pegasus/default/bin/pegasus-exitcode preprocess_ID000001.out

JOB findrange_ID000002 findrange_ID000002.sub
SCRIPT POST findrange_ID000002 /opt/pegasus/default/bin/pegasus-exitcode findrange_ID000002.out

JOB findrange_ID000003 findrange_ID000003.sub
SCRIPT POST findrange_ID000003 /opt/pegasus/default/bin/pegasus-exitcode findrange_ID000003.out

JOB analyze_ID000004 analyze_ID000004.sub
SCRIPT POST analyze_ID000004 /opt/pegasus/default/bin/pegasus-exitcode analyze_ID000004.out

JOB stage_out_local_hpcc_2_0 stage_out_local_hpcc_2_0.sub
SCRIPT POST stage_out_local_hpcc_2_0 /opt/pegasus/default/bin/pegasus-exitcode stage_out_local_hpcc_2_0.out

PARENT findrange_ID000002 CHILD analyze_ID000004
PARENT findrange_ID000003 CHILD analyze_ID000004
PARENT preprocess_ID000001 CHILD findrange_ID000002
PARENT preprocess_ID000001 CHILD findrange_ID000003
PARENT analyze_ID000004 CHILD stage_out_local_hpcc_2_0
PARENT stage_in_local_hpcc_0 CHILD preprocess_ID000001
PARENT create_dir_diamond_0_hpcc CHILD findrange_ID000002
PARENT create_dir_diamond_0_hpcc CHILD findrange_ID000003
PARENT create_dir_diamond_0_hpcc CHILD preprocess_ID000001
PARENT create_dir_diamond_0_hpcc CHILD analyze_ID000004
PARENT create_dir_diamond_0_hpcc CHILD stage_in_local_hpcc_0
######################################################################
# End of DAG
######################################################################
</pre>
<p>The DAG file declares all jobs and links them to a Condor submit
    file that describes the planned, concrete job. In the same directory as
    the DAG file are all Condor submit files for the jobs from the picture
    plus a number of additional helper files.</p>
<p>The various instructions that can be put into a DAG file are
    described in <a class="ulink" href="http://www.cs.wisc.edu/condor/manual/v7.5/2_10DAGMan_Applications.html" target="_top">Condor's
    DAGMAN documentation</a>.The constituents of the submit directory are
    described in the<a class="link" href="submit_directory.php" title="Chapter 7. Submit Directory Details"> "Submit Directory
    Details"</a>chapter</p>
</div>
<div class="section" title="5.2. Mapping Refinement Steps">
<div class="titlepage"><div><div><h2 class="title" style="clear: both">
<a name="mapping_refinement_steps"></a>5.2. Mapping Refinement Steps</h2></div></div></div>
<div class="toc"><dl>
<dt><span class="section"><a href="running_workflows.php#idp12371520">5.2.1. Data Reuse</a></span></dt>
<dt><span class="section"><a href="running_workflows.php#idp9994176">5.2.2. Site Selection</a></span></dt>
<dt><span class="section"><a href="running_workflows.php#idp20639312">5.2.3. Job Clustering</a></span></dt>
<dt><span class="section"><a href="running_workflows.php#idp20629216">5.2.4. Addition of Data Transfer and
      Registration Nodes</a></span></dt>
<dt><span class="section"><a href="running_workflows.php#idp14322992">5.2.5. Addition of Create Dir and Cleanup Jobs</a></span></dt>
<dt><span class="section"><a href="running_workflows.php#idp18085056">5.2.6. Code Generation</a></span></dt>
</dl></div>
<p>During the mapping process, the abstract workflow undergoes a series
    of refinement steps that converts it to an executable form.</p>
<div class="section" title="5.2.1. Data Reuse">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp12371520"></a>5.2.1. Data Reuse</h3></div></div></div>
<p>The abstract workflow after parsing is optionally handed over to
      the Data Reuse Module. The Data Reuse Algorithm in Pegasus attempts to
      prune all the nodes in the abstract workflow for which the output files
      exist in the Replica Catalog. It also attempts to cascade the deletion
      to the parents of the deleted node for e.g if the output files for the
      leaf nodes are specified, Pegasus will prune out all the workflow as the
      output files in which a user is interested in already exist in the
      Replica Catalog.</p>
<p>The Data Reuse Algorithm works in two passes</p>
<p><span class="bold"><strong>First Pass</strong></span> - Determine all the
      jobs whose output files exist in the Replica Catalog. An output file
      with the transfer flag set to false is treated equivalent to the file
      existing in the Replica Catalog , if the output file is not an input to
      any of the children of the job X.</p>
<p><span class="bold"><strong>Second Pass</strong></span> - The algorithm
      removes the job whose output files exist in the Replica Catalog and
      tries to cascade the deletion upwards to the parent jobs. We start the
      breadth first traversal of the workflow bottom up.</p>
<pre class="programlisting">( It is already marked for deletion in Pass 1
     OR
      ( ALL of it's children have been marked for deletion
        AND
        Node's output files have transfer flags set to false
       )
 )</pre>
<div class="tip" title="Tip" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Tip</h3>
<p>The Data Reuse Algorithm can be disabled by passing the
        <span class="bold"><strong>--force</strong></span> option to
        pegasus-plan.</p>
</div>
<div class="figure">
<a name="idp13179616"></a><p class="title"><b>Figure 5.2. Workflow Data Reuse</b></p>
<div class="figure-contents"><div class="mediaobject" align="center"><img src="./images/refinement-data-reuse.png" align="middle" alt="Workflow Data Reuse"></div></div>
</div>
<br class="figure-break">
</div>
<div class="section" title="5.2.2. Site Selection">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp9994176"></a>5.2.2. Site Selection</h3></div></div></div>
<p>The abstract workflow is then handed over to the Site Selector
      module where the abstract jobs in the pruned workflow are mapped to the
      various sites passed by a user. The target sites for planning are
      specified on the command line using the<span class="bold"><strong>
      --sites</strong></span> option to pegasus-plan. If not specified, then
      Pegasus picks up all the sites in the Site Catalog as candidate sites.
      Pegasus will map a compute job to a site only if Pegasus can</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem"><p>find an INSTALLED executable on the site</p></li>
<li class="listitem">
<p>OR find a STAGEABLE executable that can be staged to the site
          as part of the workflow execution.</p>
<p>Pegasus supports variety of site selectors with Random being
          the default</p>
<div class="itemizedlist"><ul class="itemizedlist" type="circle">
<li class="listitem">
<p><span class="bold"><strong>Random</strong></span></p>
<p>The jobs will be randomly distributed among the sites that
              can execute them.</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>RoundRobin</strong></span></p>
<p>The jobs will be assigned in a round robin manner amongst
              the sites that can execute them. Since each site cannot execute
              every type of job, the round robin scheduling is done per level
              on a sorted list. The sorting is on the basis of the number of
              jobs a particular site has been assigned in that level so far.
              If a job cannot be run on the first site in the queue (due to no
              matching entry in the transformation catalog for the
              transformation referred to by the job), it goes to the next one
              and so on. This implementation defaults to classic round robin
              in the case where all the jobs in the workflow can run on all
              the sites.</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>Group</strong></span></p>
<p>Group of jobs will be assigned to the same site that can
              execute them. The use of the<span class="bold"><strong> PEGASUS
              profile key group</strong></span> in the DAX, associates a job with a
              particular group. The jobs that do not have the profile key
              associated with them, will be put in the default group. The jobs
              in the default group are handed over to the "Random" Site
              Selector for scheduling.</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>Heft</strong></span></p>
<p>A version of the HEFT processor scheduling algorithm is
              used to schedule jobs in the workflow to multiple grid sites.
              The implementation assumes default data communication costs when
              jobs are not scheduled on to the same site. Later on this may be
              made more configurable.</p>
<p>The runtime for the jobs is specified in the
              transformation catalog by associating the <span class="bold"><strong>pegasus profile key runtime</strong></span> with the
              entries.</p>
<p>The number of processors in a site is picked up from the
              attribute <span class="bold"><strong>idle-nodes</strong></span> associated
              with the vanilla jobmanager of the site in the site
              catalog.</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>NonJavaCallout</strong></span></p>
<p>Pegasus will callout to an external site selector.In this
              mode a temporary file is prepared containing the job information
              that is passed to the site selector as an argument while
              invoking it. The path to the site selector is specified by
              setting the property pegasus.site.selector.path. The environment
              variables that need to be set to run the site selector can be
              specified using the properties with a pegasus.site.selector.env.
              prefix. The temporary file contains information about the job
              that needs to be scheduled. It contains key value pairs with
              each key value pair being on a new line and separated by a
              =.</p>
<p>The following pairs are currently generated for the site
              selector temporary file that is generated in the
              NonJavaCallout.</p>
<div class="table">
<a name="idp16220544"></a><p class="title"><b>Table 5.1. Table 1: Key Value Pairs that are currently generated
                for the site selector temporary file that is generated in the
                NonJavaCallout.</b></p>
<div class="table-contents"><table summary="Table 1: Key Value Pairs that are currently generated
                for the site selector temporary file that is generated in the
                NonJavaCallout." border="1">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td><span class="bold"><strong>Key</strong></span></td>
<td><span class="bold"><strong>Value</strong></span></td>
</tr>
<tr>
<td>version</td>
<td>is the version of the site selector api,currently
                      2.0.</td>
</tr>
<tr>
<td>transformation</td>
<td>is the fully-qualified definition identifier for
                      the transformation (TR) namespace::name:version.</td>
</tr>
<tr>
<td>derivation</td>
<td>is the fully qualified definition identifier for
                      the derivation (DV), namespace::name:version.</td>
</tr>
<tr>
<td>job.level</td>
<td>is the job's depth in the tree of the workflow
                      DAG.</td>
</tr>
<tr>
<td>job.id</td>
<td>is the job's ID, as used in the DAX file.</td>
</tr>
<tr>
<td>resource.id</td>
<td>is a pool handle, followed by whitespace,
                      followed by a gridftp server. Typically, each gridftp
                      server is enumerated once, so you may have multiple
                      occurances of the same site. There can be multiple
                      occurances of this key.</td>
</tr>
<tr>
<td>input.lfn</td>
<td>is an input LFN, optionally followed by a
                      whitespace and file size. There can be multiple
                      occurances of this key,one for each input LFN required
                      by the job.</td>
</tr>
<tr>
<td>wf.name</td>
<td>label of the dax, as found in the DAX's root
                      element. wf.index is the DAX index, that is incremented
                      for each partition in case of deferred planning.</td>
</tr>
<tr>
<td>wf.time</td>
<td>is the mtime of the workflow.</td>
</tr>
<tr>
<td>wf.manager</td>
<td>is the name of the workflow manager being used
                      .e.g condor</td>
</tr>
<tr>
<td>vo.name</td>
<td>is the name of the virtual organization that is
                      running this workflow. It is currently set to
                      NONE</td>
</tr>
<tr>
<td>vo.group</td>
<td>unused at present and is set to NONE.</td>
</tr>
</tbody>
</table></div>
</div>
<br class="table-break">
</li>
</ul></div>
</li>
</ul></div>
<div class="tip" title="Tip" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Tip</h3>
<p>The site selector to use for site selection can be specified by
        setting the property <span class="bold"><strong>pegasus.selector.site</strong></span></p>
</div>
<div class="figure">
<a name="idp20635296"></a><p class="title"><b>Figure 5.3. Workflow Site Selection</b></p>
<div class="figure-contents"><div class="mediaobject" align="center"><img src="./images/refinement-site-selection.png" align="middle" alt="Workflow Site Selection"></div></div>
</div>
<br class="figure-break">
</div>
<div class="section" title="5.2.3. Job Clustering">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp20639312"></a>5.2.3. Job Clustering</h3></div></div></div>
<p>After site selection, the workflow is optionally handed for to the
      job clustering module, which clusters jobs that are scheduled to the
      same site. Clustering is usually done on short running jobs in order to
      reduce the remote execution overheads associated with a job. Clustering
      is described in detail in the <a class="link" href="reference.php" title="Chapter 10. Reference Manual">Reference
      Manual</a> chapter.</p>
<div class="tip" title="Tip" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Tip</h3>
<p>The job clustering is turned on by passing the <span class="bold"><strong>--cluster</strong></span> option to pegasus-plan.</p>
</div>
</div>
<div class="section" title="5.2.4. Addition of Data Transfer and Registration Nodes">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp20629216"></a>5.2.4. Addition of Data Transfer and
      Registration Nodes</h3></div></div></div>
<p>After job clustering, the workflow is handed to the Data Transfer
      module that adds data stage-in , inter site and stage-out nodes to the
      workflow. Data Stage-in Nodes transfer input data required by the
      workflow from the locations specified in the Replica Catalog to a
      directory on the staging site associated with the job. The staging site
      for a job is the execution site if running in a sharedfs mode, else it
      is the one specified by <span class="bold"><strong>--staging-site</strong></span>
      option to the planner. In case, multiple locations are specified for the
      same input file, the location from where to stage the data is selected
      using a <span class="bold"><strong>Replica Selector</strong></span> . Replica
      Selection is described in detail in the Replica Selection section of the
      <a class="link" href="reference.php" title="Chapter 10. Reference Manual">Reference Manual.</a> More details about
      staging site can be found in the <a class="link" href="running_workflows.php#data_staging_configuration" title="5.3. Data Staging Configuration">data staging configuration</a>
      chapter.</p>
<p>The process of adding the data stage-in and data stage-out nodes
      is handled by Transfer Refiners. All data transfer jobs in Pegasus are
      executed using <span class="bold"><strong>pegasus-transfer</strong></span> . The
      pegasus-transfer client is a python based wrapper around various
      transfer clients like globus-url-copy, s3cmd, irods-transfer, scp, wget,
      cp, ln . It looks at source and destination url and figures out
      automatically which underlying client to use. pegasus-transfer is
      distributed with the PEGASUS and can be found in the bin subdirectory .
      Pegasus Transfer Refiners are are described in the detail in the
      Transfers section of the <a class="link" href="reference.php" title="Chapter 10. Reference Manual">Reference
      Manual</a>. The default transfer refiner that is used in Pegasus is
      the <span class="bold"><strong>Bundle</strong></span> Transfer Refiner, that
      bundles data stage-in nodes and data stage-out nodes on the basis of
      certain pegasus profile keys associated with the workflow.</p>
<div class="figure">
<a name="idp18187408"></a><p class="title"><b>Figure 5.4. Addition of Data Transfer Nodes to the Workflow</b></p>
<div class="figure-contents"><div class="mediaobject" align="center"><img src="./images/refinement-transfer-jobs.png" align="middle" alt="Addition of Data Transfer Nodes to the Workflow"></div></div>
</div>
<br class="figure-break"><p>Data Registration Nodes may also be added to the final executable
      workflow to register the location of the output files on the final
      output site back in the Replica Catalog . An output file is registered
      in the Replica Catalog if the register flag for the file is set to true
      in the DAX.</p>
<div class="figure">
<a name="idp11883152"></a><p class="title"><b>Figure 5.5. Addition of Data Registration Nodes to the Workflow</b></p>
<div class="figure-contents"><div class="mediaobject" align="center"><img src="./images/refinement-registration-jobs.png" align="middle" alt="Addition of Data Registration Nodes to the Workflow"></div></div>
</div>
<br class="figure-break"><p>The data staged-in and staged-out from a directory that is created
      on the head node by a create dir job in the workflow. In the vanilla
      case, the directory is visible to all the worker nodes and compute jobs
      are launched in this directory on the shared filesystem. In the case
      where there is no shared filesystem, users can turn on worker node
      execution, where the data is staged from the head node directory to a
      directory on the worker node filesystem. This feature will be refined
      further for Pegasus 3.1. To use it with Pegasus 3.0 send email to
      <span class="bold"><strong>pegasus-support at isi.edu</strong></span>.</p>
<div class="tip" title="Tip" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Tip</h3>
<p>The replica selector to use for replica selection can be
        specified by setting the property <span class="bold"><strong>pegasus.selector.replica</strong></span></p>
</div>
</div>
<div class="section" title="5.2.5. Addition of Create Dir and Cleanup Jobs">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp14322992"></a>5.2.5. Addition of Create Dir and Cleanup Jobs</h3></div></div></div>
<p>After the data transfer nodes have been added to the workflow,
      Pegasus adds a create dir jobs to the workflow. Pegasus usually ,
      creates one workflow specific directory per compute site , that is on
      the staging site associated with the job. In the case of shared shared
      filesystem setup, it is a directory on the shared filesystem of the
      compute site. In case of shared filesystem setup, this directory is
      visible to all the worker nodes and that is where the data is staged-in
      by the data stage-in jobs.</p>
<p>The staging site for a job is the execution site if running in a
      sharedfs mode, else it is the one specified by <span class="bold"><strong>--staging-site</strong></span> option to the planner. More
      details about staging site can be found in the <a class="link" href="running_workflows.php#data_staging_configuration" title="5.3. Data Staging Configuration">data staging configuration</a>
      chapter.</p>
<p>After addition of the create dir jobs, the workflow is optionally
      handed to the cleanup module. The cleanup module adds cleanup nodes to
      the workflow that remove data from the directory on the shared
      filesystem when it is no longer required by the workflow. This is useful
      in reducing the peak storage requirements of the workflow.</p>
<div class="tip" title="Tip" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Tip</h3>
<p>The addition of the cleanup nodes to the workflow can be
        disabled by passing the <span class="bold"><strong>--nocleanup</strong></span>
        option to pegasus-plan.</p>
</div>
<div class="figure">
<a name="idp18376784"></a><p class="title"><b>Figure 5.6. Addition of Directory Creation and File Removal Jobs</b></p>
<div class="figure-contents"><div class="mediaobject" align="center"><img src="./images/refinement-creadir-rm-jobs.png" align="middle" alt="Addition of Directory Creation and File Removal Jobs"></div></div>
</div>
<br class="figure-break"><div class="tip" title="Tip" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Tip</h3>
<p>Users can specify the maximum number of cleanup jobs added per
        level by specifying the property <span class="bold"><strong>pegasus.file.cleanup.clusters.num</strong></span> in the
        properties.</p>
</div>
</div>
<div class="section" title="5.2.6. Code Generation">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp18085056"></a>5.2.6. Code Generation</h3></div></div></div>
<p>The last step of refinement process, is the code generation where
      Pegasus writes out the executable workflow in a form understandable by
      the underlying workflow executor. At present Pegasus supports the
      following code generators</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem">
<p><span class="bold"><strong>Condor</strong></span></p>
<p>This is the default code generator for Pegasus . This
          generator generates the executable workflow as a Condor DAG file and
          associated job submit files. The Condor DAG file is passed as input
          to Condor DAGMan for job execution.</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>Shell</strong></span></p>
<p>This Code Generator generates the executable workflow as a
          shell script that can be executed on the submit host. While using
          this code generator, all the jobs should be mapped to site local i.e
          specify <span class="bold"><strong>--sites local </strong></span> to
          pegasus-plan.</p>
<div class="tip" title="Tip" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Tip</h3>
<p>To use the Shell code Generator set the property <span class="bold"><strong>pegasus.code.generator</strong></span> Shell</p>
</div>
</li>
</ol></div>
<div class="figure">
<a name="idp16199936"></a><p class="title"><b>Figure 5.7. Final Executable Workflow</b></p>
<div class="figure-contents"><div class="mediaobject" align="center"><img src="./images/refinement-final-executable-wf.png" align="middle" alt="Final Executable Workflow"></div></div>
</div>
<br class="figure-break">
</div>
</div>
<div class="section" title="5.3. Data Staging Configuration">
<div class="titlepage"><div><div><h2 class="title" style="clear: both">
<a name="data_staging_configuration"></a>5.3. Data Staging Configuration</h2></div></div></div>
<div class="toc"><dl>
<dt><span class="section"><a href="running_workflows.php#idp18191648">5.3.1. Shared File System</a></span></dt>
<dt><span class="section"><a href="running_workflows.php#non_shared_fs">5.3.2. Non Shared Filesystem</a></span></dt>
<dt><span class="section"><a href="running_workflows.php#idp16310800">5.3.3. Condor Pool Without a Shared Filesystem</a></span></dt>
</dl></div>
<p>Pegasus can be broadly setup to run workflows in the following
    configurations</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem">
<p><span class="bold"><strong>Shared File System</strong></span></p>
<p>This setup applies to where the head node and the worker nodes
        of a cluster share a filesystem. Compute jobs in the workflow run in a
        directory on the shared filesystem.</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>NonShared FileSystem</strong></span></p>
<p>This setup applies to where the head node and the worker nodes
        of a cluster don't share a filesystem. Compute jobs in the workflow
        run in a local directory on the worker node</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>Condor Pool Without a shared
        filesystem</strong></span></p>
<p>This setup applies to a condor pool where the worker nodes
        making up a condor pool don't share a filesystem. All data IO is
        achieved using Condor File IO. This is a special case of the non
        shared filesystem setup, where instead of using pegasus-transfer to
        transfer input and output data, Condor File IO is used.</p>
</li>
</ul></div>
<p>For the purposes of data configuration various sites, and
    directories are defined below.</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem">
<p><span class="bold"><strong>Submit Host</strong></span></p>
<p>The host from where the workflows are submitted . This is where
        Pegasus and Condor DAGMan are installed. This is referred to as the
        <span class="bold"><strong>"local"</strong></span> site in the site catalog
        .</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>Compute Site</strong></span></p>
<p>The site where the jobs mentioned in the DAX are executed. There
        needs to be an entry in the Site Catalog for every compute site. The
        compute site is passed to pegasus-plan using <span class="bold"><strong>--sites</strong></span> option</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>Staging Site</strong></span></p>
<p>A site to which the separate transfer jobs in the executable
        workflow ( jobs with stage_in , stage_out and stage_inter prefixes
        that Pegasus adds using the transfer refiners) stage the input data to
        and the output data from to transfer to the final output site.
        Currently, the staging site is always the compute site where the jobs
        execute.</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>Output Site</strong></span></p>
<p>The output site is the final storage site where the users want
        the output data from jobs to go to. The output site is passed to
        pegasus-plan using the <span class="bold"><strong>--output</strong></span>
        option. The stageout jobs in the workflow stage the data from the
        staging site to the final storage site.</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>Input Site</strong></span></p>
<p>The site where the input data is stored. The locations of the
        input data are catalogued in the Replica Catalog, and the pool
        attribute of the locations gives us the site handle for the input
        site.</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>Workflow Execution
        Directory</strong></span></p>
<p>This is the directory created by the create dir jobs in the
        executable workflow on the Staging Site. This is a directory per
        workflow per staging site. Currently, the Staging site is always the
        Compute Site.</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>Worker Node Directory</strong></span></p>
<p>This is the directory created on the worker nodes per job
        usually by the job wrapper that launches the job.</p>
</li>
</ol></div>
<div class="section" title="5.3.1. Shared File System">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp18191648"></a>5.3.1. Shared File System</h3></div></div></div>
<p>By default Pegasus is setup to run workflows in the shared file
      system setup, where the worker nodes and the head node of a cluster
      share a filesystem.</p>
<div class="figure">
<a name="idp18192976"></a><p class="title"><b>Figure 5.8. Shared File System Setup</b></p>
<div class="figure-contents"><div class="mediaobject" align="center"><img src="images/data-configuration-sharedfs.png" align="middle" alt="Shared File System Setup"></div></div>
</div>
<br class="figure-break"><p>The data flow is as follows in this case</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem"><p>Stagein Job executes ( either on Submit Host or Head Node ) to
          stage in input data from Input Sites ( 1---n) to a workflow specific
          execution directory on the shared filesystem.</p></li>
<li class="listitem"><p>Compute Job starts on a worker node in the workflow execution
          directory. Accesses the input data using Posix IO</p></li>
<li class="listitem"><p>Compute Job executes on the worker node and writes out output
          data to workflow execution directory using Posix IO</p></li>
<li class="listitem"><p>Stageout Job executes ( either on Submit Host or Head Node )
          to stage out output data from the workflow specific execution
          directory to a directory on the final output site.</p></li>
</ol></div>
<div class="tip" title="Tip" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Tip</h3>
<p>Set <span class="bold"><strong>pegasus.data.configuration</strong></span>
        to <span class="bold"><strong>sharedfs</strong></span> to run in this
        configuration.</p>
</div>
</div>
<div class="section" title="5.3.2. Non Shared Filesystem">
<div class="titlepage"><div><div><h3 class="title">
<a name="non_shared_fs"></a>5.3.2. Non Shared Filesystem</h3></div></div></div>
<p>In this setup , Pegasus runs workflows on local file-systems of
      worker nodes with the the worker nodes not sharing a filesystem. The
      data transfers happen between the worker node and a staging / data
      coordination site. The staging site server can be a file server on the
      head node of a cluster or can be on a separate machine.</p>
<p><span class="bold"><strong>Setup</strong></span> </p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem"><p>compute and staging site are the different</p></li>
<li class="listitem"><p>head node and worker nodes of compute site don't share a
            filesystem</p></li>
<li class="listitem"><p>Input Data is staged from remote sites.</p></li>
<li class="listitem"><p>Remote Output Site i.e site other than compute site. Can be
            submit host.</p></li>
</ul></div>
<div class="figure">
<a name="idp16683808"></a><p class="title"><b>Figure 5.9. Non Shared Filesystem Setup</b></p>
<div class="figure-contents"><div class="mediaobject" align="center"><img src="images/data-configuration-nonsharedfs.png" align="middle" alt="Non Shared Filesystem Setup"></div></div>
</div>
<br class="figure-break"><p>The data flow is as follows in this case</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem"><p>Stagein Job executes ( either on Submit Host or on staging
          site ) to stage in input data from Input Sites ( 1---n) to a
          workflow specific execution directory on the staging site.</p></li>
<li class="listitem"><p>Compute Job starts on a worker node in a local execution
          directory. Accesses the input data using pegasus transfer to
          transfer the data from the staging site to a local directory on the
          worker node</p></li>
<li class="listitem"><p>The compute job executes in the worker node, and executes on
          the worker node.</p></li>
<li class="listitem"><p>The compute Job writes out output data to the local directory
          on the worker node using Posix IO</p></li>
<li class="listitem"><p>Output Data is pushed out to the staging site from the worker
          node using pegasus-transfer.</p></li>
<li class="listitem"><p>Stageout Job executes ( either on Submit Host or staging site
          ) to stage out output data from the workflow specific execution
          directory to a directory on the final output site.</p></li>
</ol></div>
<p>In this case, the compute jobs are wrapped as <a class="link" href="running_workflows.php#pegasuslite" title="5.4. PegasusLite">PegasusLite</a> instances.</p>
<p>This mode is especially useful for running in the cloud
      environments where you don't want to setup a shared filesystem between
      the worker nodes. Running in that mode is explained in detail <a class="link" href="execution_environments.php#amazon_aws" title="6.3.1. Amazon EC2">here.</a></p>
<div class="tip" title="Tip" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Tip</h3>
<p>Set p<span class="bold"><strong>egasus.data.configuration</strong></span>
        to <span class="bold"><strong>nonsharedfs</strong></span> to run in this
        configuration. The staging site can be specified using the <span class="bold"><strong>--staging-site</strong></span> option to pegasus-plan.</p>
</div>
</div>
<div class="section" title="5.3.3. Condor Pool Without a Shared Filesystem">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp16310800"></a>5.3.3. Condor Pool Without a Shared Filesystem</h3></div></div></div>
<p>This setup applies to a condor pool where the worker nodes making
      up a condor pool don't share a filesystem. All data IO is achieved using
      Condor File IO. This is a special case of the non shared filesystem
      setup, where instead of using pegasus-transfer to transfer input and
      output data, Condor File IO is used.</p>
<p><span class="bold"><strong>Setup</strong></span> </p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem"><p>Submit Host and staging site are same</p></li>
<li class="listitem"><p>head node and worker nodes of compute site don't share a
            filesystem</p></li>
<li class="listitem"><p>Input Data is staged from remote sites.</p></li>
<li class="listitem"><p>Remote Output Site i.e site other than compute site. Can be
            submit host.</p></li>
</ul></div>
<div class="figure">
<a name="idp16778016"></a><p class="title"><b>Figure 5.10. Condor Pool Without a Shared Filesystem</b></p>
<div class="figure-contents"><div class="mediaobject" align="center"><img src="images/data-configuration-condorio.png" align="middle" alt="Condor Pool Without a Shared Filesystem"></div></div>
</div>
<br class="figure-break"><p>The data flow is as follows in this case</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem"><p>Stagein Job executes on the submit host to stage in input data
          from Input Sites ( 1---n) to a workflow specific execution directory
          on the submit host</p></li>
<li class="listitem"><p>Compute Job starts on a worker node in a local execution
          directory. Before the compute job starts, Condor transfers the input
          data for the job from the workflow execution directory on the submit
          host to the local execution directory on the worker node.</p></li>
<li class="listitem"><p>The compute job executes in the worker node, and executes on
          the worker node.</p></li>
<li class="listitem"><p>The compute Job writes out output data to the local directory
          on the worker node using Posix IO</p></li>
<li class="listitem"><p>When the compute job finishes, Condor transfers the output
          data for the job from the local execution directory on the worker
          node to the workflow execution directory on the submit host.</p></li>
<li class="listitem"><p>Stageout Job executes ( either on Submit Host or staging site
          ) to stage out output data from the workflow specific execution
          directory to a directory on the final output site.</p></li>
</ol></div>
<p>In this case, the compute jobs are wrapped as <a class="link" href="running_workflows.php#pegasuslite" title="5.4. PegasusLite">PegasusLite</a> instances.</p>
<p>This mode is especially useful for running in the cloud
      environments where you don't want to setup a shared filesystem between
      the worker nodes. Running in that mode is explained in detail <a class="link" href="execution_environments.php#amazon_aws" title="6.3.1. Amazon EC2">here.</a></p>
<div class="tip" title="Tip" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Tip</h3>
<p>Set p<span class="bold"><strong>egasus.data.configuration</strong></span>
        to <span class="bold"><strong>condorio</strong></span> to run in this
        configuration. In this mode, the staging site is automatically set to
        site <span class="bold"><strong>local</strong></span></p>
</div>
</div>
</div>
<div class="section" title="5.4. PegasusLite">
<div class="titlepage"><div><div><h2 class="title" style="clear: both">
<a name="pegasuslite"></a>5.4. PegasusLite</h2></div></div></div>
<p>Starting Pegasus 4.0 , all compute jobs ( single or clustered jobs)
    that are executed in a non shared filesystem setup, are executed using
    lightweight job wrapper called PegasusLite.</p>
<div class="figure">
<a name="idp17546416"></a><p class="title"><b>Figure 5.11. Workflow Running in NonShared Filesystem Setup with PegasusLite
      launching compute jobs</b></p>
<div class="figure-contents"><div class="mediaobject" align="center"><img src="images/data-configuration-pegasuslite.png" align="middle" alt="Workflow Running in NonShared Filesystem Setup with PegasusLite launching compute jobs"></div></div>
</div>
<br class="figure-break"><p>When PegasusLite starts on a remote worker node to run a compute job
    , it performs the following actions:</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem"><p>Discovers the best run-time directory based on space
        requirements and create the directory on the local filesystem of the
        worker node to execute the job.</p></li>
<li class="listitem"><p>Prepare the node for executing the unit of work. This involves
        discovering whether the pegasus worker tools are already installed on
        the node or need to be brought in.</p></li>
<li class="listitem"><p>Use pegasus-transfer to stage in the input data to the runtime
        directory (created in step 1) on the remote worker node.</p></li>
<li class="listitem"><p>Launch the compute job.</p></li>
<li class="listitem"><p>Use pegasus-transfer to stage out the output data to the data
        coordination site.</p></li>
<li class="listitem"><p>Remove the directory created in Step 1.</p></li>
</ol></div>
</div>
<div class="section" title="5.5. Pegasus-Plan">
<div class="titlepage"><div><div><h2 class="title" style="clear: both">
<a name="pegasus-plan"></a>5.5. Pegasus-Plan</h2></div></div></div>
<p>pegasus-plan is the main executable that takes in the abstract
    workflow ( DAX ) and generates an executable workflow ( usually a Condor
    DAG ) by querying various catalogs and performing several refinement
    steps. Before users can run pegasus plan the following needs to be
    done:</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem">
<p>Populate the various catalogs</p>
<div class="orderedlist"><ol class="orderedlist" type="a">
<li class="listitem">
<p><span class="bold"><strong>Replica Catalog</strong></span></p>
<p>The Replica Catalog needs to be catalogued with the
            locations of the input files required by the workflows. This can
            be done by using pegasus-rc-client (See the Replica section of
            <a class="link" href="creating_workflows.php" title="Chapter 4. Creating Workflows">Creating
            Workflows</a>).</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>Transformation
            Catalog</strong></span></p>
<p>The Transformation Catalog needs to be catalogued with the
            locations of the executables that the workflows will use. This can
            be done by using pegasus-tc-client (See the Transformation section
            of <a class="link" href="creating_workflows.php" title="Chapter 4. Creating Workflows">Creating
            Workflows</a>).</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>Site Catalog</strong></span></p>
<p>The Site Catalog needs to be catalogued with the site layout
            of the various sites that the workflows can execute on. A site
            catalog can be generated for OSG by using the client
            pegasus-sc-client (See the Site section of the <a class="link" href="creating_workflows.php" title="Chapter 4. Creating Workflows">Creating Workflows</a>).</p>
</li>
</ol></div>
</li>
<li class="listitem">
<p>Configure Properties</p>
<p>After the catalogs have been configured, the user properties
        file need to be updated with the types and locations of the catalogs
        to use. These properties are described in the <span class="bold"><strong>basic.properties</strong></span> files in the <span class="bold"><strong>etc</strong></span> sub directory (see the Properties section
        of the<a class="link" href="reference.php" title="Chapter 10. Reference Manual">Reference</a> chapter.</p>
<p>The basic properties that need to be set usually are listed
        below:</p>
<div class="table">
<a name="idp17859248"></a><p class="title"><b>Table 5.2. Table2: Basic Properties that need to be set</b></p>
<div class="table-contents"><table summary="Table2: Basic Properties that need to be set" border="1">
<colgroup><col></colgroup>
<tbody>
<tr><td>pegasus.catalog.replica</td></tr>
<tr><td>pegasus.catalog.replica.file |
                pegasus.catalog.replica.url</td></tr>
<tr><td>pegasus.catalog.transformation</td></tr>
<tr><td>pegasus.catalog.transformation.file</td></tr>
<tr><td>pegasus.catalog.site.file</td></tr>
</tbody>
</table></div>
</div>
<br class="table-break">
</li>
</ol></div>
<p>To execute pegasus-plan user usually requires to specify the
    following options:</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem"><p><span class="bold"><strong>--dax </strong></span> the path to the DAX file
        that needs to be mapped.</p></li>
<li class="listitem"><p><span class="bold"><strong>--dir </strong></span> the base directory where
        the executable workflow is generated</p></li>
<li class="listitem"><p><span class="bold"><strong>--sites </strong></span> comma separated list
        of execution sites.</p></li>
<li class="listitem"><p><span class="bold"><strong>--output</strong></span> the output site where
        to transfer the materialized output files.</p></li>
<li class="listitem"><p><span class="bold"><strong>--submit </strong></span> boolean value whether
        to submit the planned workflow for execution after planning is
        done.</p></li>
</ol></div>
</div>
<div class="section" title="5.6. Basic Properties">
<div class="titlepage"><div><div><h2 class="title" style="clear: both">
<a name="BasicProperties"></a>5.6. Basic Properties</h2></div></div></div>
<div class="toc"><dl>
<dt><span class="section"><a href="running_workflows.php#BasicPropertiespegasus.home">5.6.1. pegasus.home</a></span></dt>
<dt><span class="section"><a href="running_workflows.php#BasicPropertiesCatalogProperties">5.6.2. Catalog Properties</a></span></dt>
<dt><span class="section"><a href="running_workflows.php#BasicPropertiesDataStagingConfiguration">5.6.3. Data Staging Configuration</a></span></dt>
</dl></div>
<p></p>
<p>This is the reference guide to the basic properties regarding the
Pegasus Workflow Planner, and their respective default values. Please refer
to the advanced properties guide to know about all the properties that
a user can use to configure the Pegasus Workflow Planner.
Please note that the values rely on proper capitalization, unless explicitly
noted otherwise.
</p>
<p>Some properties rely with their default on the value of other
properties. As a notation, the curly braces refer to the value of the
named property. For instance, ${pegasus.home} means that the value depends
on the value of the pegasus.home property plus any noted additions. You
can use this notation to refer to other properties, though the extent
of the subsitutions are limited. Usually, you want to refer to a set
of the standard system properties. Nesting is not allowed.
Substitutions will only be done once.
</p>
<p>There is a priority to the order of reading and evaluating properties.
Usually one does not need to worry about the priorities. However, it
is good to know the details of when which property applies, and how
one property is able to overwrite another. The following is a mutually exclusive
list ( highest priority first ) of property file locations.
</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem">--conf option to the tools. Almost all of the clients that use properties
have a --conf option to specify the property file to pick up.
</li>
<li class="listitem"> submit-dir/pegasus.xxxxxxx.properties file. All tools that work on the
submit directory ( i.e after pegasus has planned a workflow) pick up the
pegasus.xxxxx.properties file from the submit directory. The location for the
pegasus.xxxxxxx.propertiesis picked up from the braindump file.
</li>
<li class="listitem">The properties defined in the user property file
<span class="emphasis"><em>${user.home}/.pegasusrc</em></span> have lowest priority.
</li>
</ol></div>
<p>
</p>
<p>Commandline properties have the highest priority. These override any property loaded
from a property file. Each  commandline property is introduced by a -D argument.
Note that these arguments  are parsed by the shell wrapper, and thus the -D arguments
must be the first arguments to any command. Commandline properties are useful for debugging
purposes.
</p>
<p>From Pegasus 3.1 release onwards, support has been dropped for the following
properties that were used to signify the location of the properties file
</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem">pegasus.properties</li>
<li class="listitem">pegasus.user.properties</li>
</ul></div>
<p>
</p>
<p>The following example provides a sensible set of properties to be set
by the user property file. These properties use mostly non-default
settings. It is an example only, and will not work for you:
</p>
<pre class="screen">
pegasus.catalog.replica              File
pegasus.catalog.replica.file         ${pegasus.home}/etc/sample.rc.data
pegasus.catalog.replica              Regex
pegasus.catalog.replica.file         ${pegasus.home}/etc/sample.rc.data
pegasus.catalog.transformation       Text
pegasus.catalog.transformation.file  ${pegasus.home}/etc/sample.tc.text
pegasus.catalog.site.file            ${pegasus.home}/etc/sample.sites.xml
</pre>
<p>
</p>
<p>If you are in doubt which properties are actually visible, pegasus during the
planning of the workflow  dumps all properties after reading and prioritizing
in the submit directory in a file with the suffix properties.
</p>
<div class="section" title="5.6.1. pegasus.home">
<div class="titlepage"><div><div><h3 class="title">
<a name="BasicPropertiespegasus.home"></a>5.6.1. pegasus.home</h3></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">Systems:</td>
<td align="left">all</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">directory location string</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">"$PEGASUS_HOME"</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p></p>
<p>The property pegasus.home cannot be set in the property file. This property is
automatically set up by the pegasus clients internally by determining the installation
directory of pegasus. Knowledge about this property is important for developers who
want to invoke PEGASUS JAVA classes without the shell wrappers.
</p>
</div>
<div class="section" title="5.6.2. Catalog Properties">
<div class="titlepage"><div><div><h3 class="title">
<a name="BasicPropertiesCatalogProperties"></a>5.6.2. Catalog Properties</h3></div></div></div>
<p></p>
<p></p>
<div class="section" title="5.6.2.1. Replica Catalog">
<div class="titlepage"><div><div><h4 class="title">
<a name="BasicPropertiesReplicaCatalog"></a>5.6.2.1. Replica Catalog</h4></div></div></div>
<p></p>
<div class="section" title="5.6.2.1.1. pegasus.catalog.replica">
<div class="titlepage"><div><div><h5 class="title">
<a name="BasicPropertiespegasus.catalog.replica"></a>5.6.2.1.1. pegasus.catalog.replica</h5></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.0</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">enumeration</td>
</tr>
<tr>
<td align="left">Value[0]:</td>
<td align="left">RLS</td>
</tr>
<tr>
<td align="left">Value[1]:</td>
<td align="left">LRC</td>
</tr>
<tr>
<td align="left">Value[2]:</td>
<td align="left">JDBCRC</td>
</tr>
<tr>
<td align="left">Value[3]:</td>
<td align="left">File</td>
</tr>
<tr>
<td align="left">Value[4]:</td>
<td align="left">Directory</td>
</tr>
<tr>
<td align="left">Value[5]:</td>
<td align="left">MRC</td>
</tr>
<tr>
<td align="left">Value[6]:</td>
<td align="left">Regex</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">RLS</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>Pegasus queries a Replica Catalog to discover the physical filenames
(PFN) for input files specified in the DAX. Pegasus can interface
with various types of Replica Catalogs. This property specifies
which type of Replica Catalog to use during the planning process.
</p>
<div class="variablelist"><dl>
<dt><span class="term">RLS</span></dt>
<dd>
RLS (Replica Location Service) is a distributed replica
catalog, which ships with GT4. There is an index service called
Replica Location Index (RLI) to which 1 or more Local Replica
Catalog (LRC) report. Each LRC can contain all or a subset of
mappings. In this mode, Pegasus queries the central RLI to
discover in which LRC's the mappings for a LFN reside. It then
queries the individual LRC's for the PFN's.
To use RLS, the user additionally needs to set the property
pegasus.catalog.replica.url to specify the URL for the RLI to
query.
Details about RLS can be found at
http://www.globus.org/toolkit/data/rls/
</dd>
<dt><span class="term">LRC</span></dt>
<dd>
If the user does not want to query the RLI, but directly a
single Local Replica Catalog.
To use LRC, the user additionally needs to set the property
pegasus.catalog.replica.url to specify the URL for the LRC to
query.
Details about RLS can be found at
http://www.globus.org/toolkit/data/rls/
</dd>
<dt><span class="term">JDBCRC</span></dt>
<dd>
In this mode, Pegasus queries a SQL based replica catalog that
is accessed via JDBC. The sql schema's for this catalog can be
found at $PEGASUS_HOME/sql directory.
To use JDBCRC, the user additionally needs to set the following
properties
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem">pegasus.catalog.replica.db.url</li>
<li class="listitem">pegasus.catalog.replica.db.user</li>
<li class="listitem">pegasus.catalog.replica.db.password</li>
</ol></div>
</dd>
<dt><span class="term">File</span></dt>
<dd>
<p>In this mode, Pegasus queries a file based replica catalog.
It is neither transactionally safe, nor advised to use for
production purposes in any way. Multiple concurrent access to
the File will end up clobbering the contents of the file.  The
site attribute should be specified whenever possible. The attribute
key for the site attribute is "pool".
</p>
<p>The LFN may or may not be quoted. If it contains linear
whitespace, quotes, backslash or an equality sign, it must be
quoted and escaped. Ditto for the PFN. The attribute key-value
pairs are separated by an equality sign  without any
whitespaces. The value may be in quoted. The LFN  sentiments about quoting apply.
</p>
<pre class="screen">
LFN PFN
LFN PFN a=b [..]
LFN PFN a="b" [..]
"LFN w/LWS" "PFN w/LWS" [..]
</pre>
<p>
</p>
<p>To use File, the user additionally needs to specify
pegasus.catalog.replica.file property to specify the path to the
file based RC.
</p>
</dd>
<dt><span class="term">Regex</span></dt>
<dd>
<p>In this mode, Pegasus queries a file based replica catalog.
It is neither transactionally safe, nor advised to use for
production purposes in any way. Multiple concurrent access to
the File will end up clobbering the contents of the file.  The
site attribute should be specified whenever possible. The attribute
key for the site attribute is "pool".
</p>
<p>The LFN may or may not be quoted. If it contains linear
whitespace, quotes, backslash or an equality sign, it must be
quoted and escaped. Ditto for the PFN. The attribute key-value
pairs are separated by an equality sign  without any
whitespaces. The value may be in quoted. The LFN  sentiments about quoting apply.
</p>
<p>In addition users can specifiy regular expression based LFN's. A regular expression
based entry should be qualified with an attribute named 'regex'. The attribute regex
when set to true identifies the catalog entry as a regular expression based entry.
Regular expressions should follow Java regular expression syntax.
</p>
<p>For example, consider a replica catalog as shown below.
</p>
<p>Entry 1 refers to an entry which does not use a resular expressions. This entry
would only match a file named 'f.a', and nothing else.
Entry 2 referes to an entry which uses a regular expression. In this entry f.a
referes to files having name as f[any-character]a i.e. faa, f.a, f0a, etc.
</p>
<pre class="screen">
f.a file:///Volumes/data/input/f.a pool="local"
f.a file:///Volumes/data/input/f.a pool="local" regex="true"
</pre>
<p>
</p>
<p>Regular expression based entries also support substitutions. For example,
consider the regular expression based entry shown below.
</p>
<p>Entry 3 will match files with name alpha.csv, alpha.txt, alpha.xml.
In addition, values matched in the expression can be used to generate a PFN.
</p>
<p>For the entry below if the file being looked up is alpha.csv, the PFN for the file
would be generated as file:///Volumes/data/input/csv/alpha.csv. Similary if the
file being lookedup was alpha.csv, the PFN for the file would be generated as
file:///Volumes/data/input/xml/alpha.xml i.e. The section [0], [1] will be replaced.
Section [0] refers to the entire string i.e. alpha.csv. Section [1] refers to a partial
match in the input i.e. csv, or txt, or xml. Users can utilize as many sections as they wish.
</p>
<pre class="screen">
alpha\.(csv|txt|xml) file:///Volumes/data/input/[1]/[0] pool="local" regex="true"
</pre>
<p>
</p>
<p>To use File, the user additionally needs to specify
pegasus.catalog.replica.file property to specify the path to the
file based RC.
</p>
</dd>
<dt><span class="term">Directory</span></dt>
<dd>
<p>In this mode, Pegasus does a directory listing on an input
directory to create the LFN to PFN mappings. The directory listing is
performed recursively, resulting in deep LFN mappings. For example, if an
input directory $input is specified with the following structure
</p>
<pre class="screen">
$input
$input/f.1
$input/f.2
$input/D1
$input/D1/f.3
</pre>
<p>
Pegasus will create the mappings the following LFN PFN mappings internally
</p>
<pre class="screen">
f.1 file://$input/f.1  pool="local"
f.2 file://$input/f.2  pool="local"
D1/f.3 file://$input/D2/f.3 pool="local"
</pre>
<p>
</p>
<p>pegasus-plan has --input-dir option that can be used to specify an input
directory.
</p>
<p>Users can optionally specify additional properties to configure the behvavior
of this implementation.
</p>
<p>pegasus.catalog.replica.directory.site  to specify a site attribute other than
local to associate with the mappings.
</p>
<p>pegasus.catalog.replica.directory.url.prefix to associate a URL prefix for the PFN's
constructed. If not specified, the URL defaults to file://
</p>
</dd>
<dt><span class="term">MRC</span></dt>
<dd>
<p>In this mode, Pegasus queries multiple replica catalogs to
discover the file locations on the grid.  To use it set
</p>
<pre class="screen">
pegasus.catalog.replica MRC
</pre>
<p>
</p>
<p>Each associated replica catalog can be configured via properties
as follows.
</p>
<p>The user associates a variable name referred to as [value] for
each of the catalogs, where [value] is any legal identifier
(concretely [A-Za-z][_A-Za-z0-9]*) For each associated replica
catalogs the user specifies the following properties.
</p>
<pre class="screen">
pegasus.catalog.replica.mrc.[value]       specifies the type of replica catalog.
pegasus.catalog.replica.mrc.[value].key   specifies a property name key for a
particular catalog
</pre>
<p>
</p>
<p>For example, if a user wants to query two lrc's at the same time
he/she can specify as follows
</p>
<pre class="screen">
pegasus.catalog.replica.mrc.lrc1 LRC
pegasus.catalog.replica.mrc.lrc2.url rls://sukhna
pegasus.catalog.replica.mrc.lrc2 LRC
pegasus.catalog.replica.mrc.lrc2.url rls://smarty
</pre>
<p>
</p>
<p>In the above example, lrc1, lrc2 are any valid identifier names
and url is the property key that needed to be specified.
</p>
</dd>
</dl></div>
<p>
</p>
<p></p>
</div>
<div class="section" title="5.6.2.1.2. pegasus.catalog.replica.url">
<div class="titlepage"><div><div><h5 class="title">
<a name="BasicPropertiespegasus.catalog.replica.url"></a>5.6.2.1.2. pegasus.catalog.replica.url</h5></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.0</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">URI string</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">(no default)</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>When using the modern RLS replica catalog, the URI to the Replica
catalog must be  provided to Pegasus to enable it to look up
filenames. There is no  default.
</p>
</div>
</div>
<div class="section" title="5.6.2.2. Site Catalog">
<div class="titlepage"><div><div><h4 class="title">
<a name="BasicPropertiesSiteCatalog"></a>5.6.2.2. Site Catalog</h4></div></div></div>
<p></p>
<p></p>
<div class="section" title="5.6.2.2.1. pegasus.catalog.site.file">
<div class="titlepage"><div><div><h5 class="title">
<a name="BasicPropertiespegasus.catalog.site.file"></a>5.6.2.2.1. pegasus.catalog.site.file</h5></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Site Catalog</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.0</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">file location string</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">${pegasus.home.sysconfdir}/sites.xml</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>Running things on the grid requires an extensive description of the
capabilities of each compute cluster, commonly termed "site". This
property describes the location of the file that contains such a site
description. As the format is currently in flow, please refer to the
userguide and Pegasus for details which format is expected.
</p>
</div>
</div>
<div class="section" title="5.6.2.3. Transformation Catalog">
<div class="titlepage"><div><div><h4 class="title">
<a name="BasicPropertiesTransformationCatalog"></a>5.6.2.3. Transformation Catalog</h4></div></div></div>
<p></p>
<p></p>
<div class="section" title="5.6.2.3.1. pegasus.catalog.transformation">
<div class="titlepage"><div><div><h5 class="title">
<a name="BasicPropertiespegasus.catalog.transformation"></a>5.6.2.3.1. pegasus.catalog.transformation</h5></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Transformation Catalog</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.0</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">enumeration</td>
</tr>
<tr>
<td align="left">Value[0]:</td>
<td align="left">Text</td>
</tr>
<tr>
<td align="left">Value[1]:</td>
<td align="left">File</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">Text</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.catalog.transformation.file</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<div class="variablelist"><dl>
<dt><span class="term">Text</span></dt>
<dd>
<p>In this mode, a multiline file based format is understood. The file
is read and cached in memory. Any modifications, as adding or
deleting, causes an update of the memory and hence to the file
underneath. All queries are done against the memory
representation.
</p>
<p>The file sample.tc.text in the etc directory contains an example
</p>
<p>Here is a sample textual format for transfomation catalog containing
one transformation on two sites
</p>
<pre class="screen">
tr example::keg:1.0 {
#specify profiles that apply for all the sites for the transformation
#in each site entry the profile can be overriden
profile env "APP_HOME" "/tmp/karan"
profile env "JAVA_HOME" "/bin/app"
site isi {
profile env "me" "with"
profile condor "more" "test"
profile env "JAVA_HOME" "/bin/java.1.6"
pfn "/path/to/keg"
arch  "x86"
os    "linux"
osrelease "fc"
osversion "4"
type "INSTALLED"
site wind {
profile env "me" "with"
profile condor "more" "test"
pfn "/path/to/keg"
arch  "x86"
os    "linux"
osrelease "fc"
osversion "4"
type "STAGEABLE"
</pre>
<p>
</p>
</dd>
<dt><span class="term">File</span></dt>
<dd>THIS FORMAT IS DEPRECATED. WILL BE REMOVED IN COMING VERSIONS.
USE pegasus-tc-converter to convert File format to Text Format.
In this mode, a file format is understood. The file is
read and cached in memory. Any modifications, as adding or
deleting, causes an update of the memory and hence to the file
underneath. All queries are done against the memory
representation. The new TC file format uses 6 columns:
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem">The resource ID is represented in the first column.</li>
<li class="listitem">The logical transformation uses the colonized format
ns::name:vs.</li>
<li class="listitem">The path to the application on the system</li>
<li class="listitem">The installation type is identified by one of the following
keywords - all upper case: INSTALLED, STAGEABLE.
If not specified, or <span class="command"><strong>NULL</strong></span> is used, the type
defaults to INSTALLED.</li>
<li class="listitem">The system is of the format ARCH::OS[:VER:GLIBC]. The
following arch types are understood: "INTEL32", "INTEL64",
"SPARCV7", "SPARCV9".
The following os types are understood: "LINUX", "SUNOS",
"AIX". If unset or <span class="command"><strong>NULL</strong></span>, defaults to
INTEL32::LINUX.</li>
<li class="listitem">Profiles are written in the format
NS::KEY=VALUE,KEY2=VALUE;NS2::KEY3=VALUE3
Multiple key-values for same namespace are seperated by a
comma "," and multiple namespaces are seperated by a
semicolon ";". If any of your profile values contains a
comma  you must not use the namespace abbreviator.</li>
</ol></div>
</dd>
</dl></div>
<p>
</p>
<p></p>
</div>
<div class="section" title="5.6.2.3.2. pegasus.catalog.transformation.file">
<div class="titlepage"><div><div><h5 class="title">
<a name="BasicPropertiespegasus.catalog.transformation.file"></a>5.6.2.3.2. pegasus.catalog.transformation.file</h5></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">Systems:</td>
<td align="left">Transformation Catalog</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">file location string</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">${pegasus.home.sysconfdir}/tc.text | ${pegasus.home.sysconfdir}/tc.data</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.catalog.transformation</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>This property is used to set the path to the textual transformation
catalogs of type File or Text. If the transformation catalog is of type Text
then tc.text file is picked up from sysconfdir, else tc.data
</p>
<p></p>
</div>
</div>
</div>
<div class="section" title="5.6.3. Data Staging Configuration">
<div class="titlepage"><div><div><h3 class="title">
<a name="BasicPropertiesDataStagingConfiguration"></a>5.6.3. Data Staging Configuration</h3></div></div></div>
<p></p>
<div class="section" title="5.6.3.1. pegasus.data.configuration">
<div class="titlepage"><div><div><h4 class="title">
<a name="BasicPropertiespegasus.data.configuration"></a>5.6.3.1. pegasus.data.configuration</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">3.1</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">enumeration</td>
</tr>
<tr>
<td align="left">Value[0]:</td>
<td align="left">sharedfs</td>
</tr>
<tr>
<td align="left">Value[1]:</td>
<td align="left">nonsharedfs</td>
</tr>
<tr>
<td align="left">Value[2]:</td>
<td align="left">condorio</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">sharedfs</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>This property sets up Pegasus to run in different environments.
</p>
<div class="variablelist"><dl>
<dt><span class="term">sharedfs</span></dt>
<dd>If this is set, Pegasus will be setup to execute jobs on the shared
filesystem on the execution site. This assumes, that the head node of a cluster
and the worker nodes share a filesystem. The staging site in this case is
the same as the execution site. Pegasus adds a create dir job to the executable
workflow that creates a workflow specific directory on the shared filesystem .
The data transfer jobs in the executable workflow ( stage_in_ , stage_inter_ ,
stage_out_ ) transfer the data to this directory.The compute jobs in the
executable workflow are launched in the directory on the shared  filesystem.
Internally, if this is set the following properties are set.
<pre class="screen">
pegasus.execute.*.filesystem.local   false
</pre>
</dd>
<dt><span class="term">condorio</span></dt>
<dd>If this is set, Pegasus will be setup to run jobs in a pure condor pool,
with the nodes not sharing a filesystem. Data is staged to the compute nodes from
the submit host using Condor File IO.
The planner is automatically setup to use the submit host ( site local ) as the
staging site. All the auxillary jobs added by the planner to the executable
workflow ( create dir, data stagein and stage-out, cleanup ) jobs refer to
the workflow specific directory on the local site.  The data transfer jobs in
the executable workflow ( stage_in_ , stage_inter_ , stage_out_ ) transfer the
data to this directory. When the compute jobs start, the input data for each
job is shipped from the workflow specific directory on the submit host to
compute/worker node using Condor file IO. The output data for each job is
similarly shipped back to the submit host from the compute/worker node.
This setup is particularly helpful when running workflows in the cloud
environment where setting up a shared filesystem across the VM's may be
tricky.
On loading this property, internally the following properies are set
<pre class="screen">
pegasus.transfer.sls.*.impl          Condor
pegasus.execute.*.filesystem.local   true
pegasus.gridstart 		   PegasusLite
pegasus.transfer.worker.package      true
</pre>
</dd>
<dt><span class="term">nonsharedfs</span></dt>
<dd>If this is set, Pegasus will be setup to execute jobs on an execution site
without relying on a shared filesystem between the head node and the worker nodes.
You can specify staging site ( using --staging-site option to pegasus-plan) to
indicate the site to use as a central storage location for a workflow. The
staging site is independant of the execution sites on which a workflow executes.
All the auxillary jobs added by the planner to the executable
workflow ( create dir, data stagein and stage-out, cleanup ) jobs refer to
the workflow specific directory on the staging site.  The data transfer jobs in
the executable workflow ( stage_in_ , stage_inter_ , stage_out_ ) transfer the
data to this directory. When the compute jobs start, the input data for each
job is shipped from the workflow specific directory on the submit host to
compute/worker node using pegasus-transfer. The output data for each job is
similarly shipped back to the submit host from the compute/worker node.
The protocols supported are at this time SRM, GridFTP, iRods, S3.
This setup is particularly helpful when running workflows on OSG where
most of the execution sites don't have enough data storage. Only a few
sites have large amounts of data storage exposed that can be used to place
data during a workflow run. This setup is also helpful when running workflows
in the cloud environment where setting up a shared filesystem across the VM's may be
tricky.
On loading this property, internally the following properies are set
<pre class="screen">
pegasus.execute.*.filesystem.local   true
pegasus.gridstart 		   PegasusLite
pegasus.transfer.worker.package      true
</pre>
</dd>
</dl></div>
<p>
</p>
<p></p>
</div>
</div>
</div>
</div>
<div class="navfooter">
<hr>
<table width="100%" summary="Navigation footer">
<tr>
<td width="40%" align="left">
<a accesskey="p" href="creating_workflows.php">Prev</a> </td>
<td width="20%" align="center"> </td>
<td width="40%" align="right"> <a accesskey="n" href="execution_environments.php">Next</a>
</td>
</tr>
<tr>
<td width="40%" align="left" valign="top">Chapter 4. Creating Workflows </td>
<td width="20%" align="center"><a accesskey="h" href="index.php">Table of Contents</a></td>
<td width="40%" align="right" valign="top"> Chapter 6. Execution Environments</td>
</tr>
</table>
</div>
</div><?php  
            do_html_footer();
        ?>
