<?php  
            include_once( $_SERVER['DOCUMENT_ROOT']."/static/includes/common.inc.php" );
            do_html_header("Documentation");
        ?><div id="content">
<div class="navheader">
<table width="100%" summary="Navigation header"><tr>
<td width="20%" align="left">
<a accesskey="p" href="mapping_refinement_steps.php">Prev</a> </td>
<td width="60%" align="center"><a accesskey="h" href="index.php">Table of Contents</a></td>
<td width="20%" align="right"> <a accesskey="n" href="pegasuslite.php">Next</a>
</td>
</tr></table>
<hr>
</div>
<div class="section">
<div class="titlepage"><div><div><h2 class="title" style="clear: both">
<a name="data_staging_configuration"></a>5.3. Data Staging Configuration</h2></div></div></div>
<div class="toc"><dl class="toc">
<dt><span class="section"><a href="data_staging_configuration.php#idp61893120">5.3.1. Shared File System</a></span></dt>
<dt><span class="section"><a href="data_staging_configuration.php#non_shared_fs">5.3.2. Non Shared Filesystem</a></span></dt>
<dt><span class="section"><a href="data_staging_configuration.php#idp62139184">5.3.3. Condor Pool Without a Shared Filesystem</a></span></dt>
</dl></div>
<p>Pegasus can be broadly setup to run workflows in the following
    configurations</p>
<div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; ">
<li class="listitem">
<p><span class="bold"><strong>Shared File System</strong></span></p>
<p>This setup applies to where the head node and the worker nodes
        of a cluster share a filesystem. Compute jobs in the workflow run in a
        directory on the shared filesystem.</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>NonShared FileSystem</strong></span></p>
<p>This setup applies to where the head node and the worker nodes
        of a cluster don't share a filesystem. Compute jobs in the workflow
        run in a local directory on the worker node</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>Condor Pool Without a shared
        filesystem</strong></span></p>
<p>This setup applies to a condor pool where the worker nodes
        making up a condor pool don't share a filesystem. All data IO is
        achieved using Condor File IO. This is a special case of the non
        shared filesystem setup, where instead of using pegasus-transfer to
        transfer input and output data, Condor File IO is used.</p>
</li>
</ul></div>
<p>For the purposes of data configuration various sites, and
    directories are defined below.</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem">
<p><span class="bold"><strong>Submit Host</strong></span></p>
<p>The host from where the workflows are submitted . This is where
        Pegasus and Condor DAGMan are installed. This is referred to as the
        <span class="bold"><strong>"local"</strong></span> site in the site catalog
        .</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>Compute Site</strong></span></p>
<p>The site where the jobs mentioned in the DAX are executed. There
        needs to be an entry in the Site Catalog for every compute site. The
        compute site is passed to pegasus-plan using <span class="bold"><strong>--sites</strong></span> option</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>Staging Site</strong></span></p>
<p>A site to which the separate transfer jobs in the executable
        workflow ( jobs with stage_in , stage_out and stage_inter prefixes
        that Pegasus adds using the transfer refiners) stage the input data to
        and the output data from to transfer to the final output site.
        Currently, the staging site is always the compute site where the jobs
        execute.</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>Output Site</strong></span></p>
<p>The output site is the final storage site where the users want
        the output data from jobs to go to. The output site is passed to
        pegasus-plan using the <span class="bold"><strong>--output</strong></span>
        option. The stageout jobs in the workflow stage the data from the
        staging site to the final storage site.</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>Input Site</strong></span></p>
<p>The site where the input data is stored. The locations of the
        input data are catalogued in the Replica Catalog, and the pool
        attribute of the locations gives us the site handle for the input
        site.</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>Workflow Execution
        Directory</strong></span></p>
<p>This is the directory created by the create dir jobs in the
        executable workflow on the Staging Site. This is a directory per
        workflow per staging site. Currently, the Staging site is always the
        Compute Site.</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>Worker Node Directory</strong></span></p>
<p>This is the directory created on the worker nodes per job
        usually by the job wrapper that launches the job.</p>
</li>
</ol></div>
<p>You can specifiy the data configuration to use either in </p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem"><p>properties - Specify the global property <a class="link" href="properties.php#data_conf_props" title="12.3.8. Data Staging Configuration Properties">pegasus.data.configuration</a> .</p></li>
<li class="listitem"><p>site catalog - Starting 4.5.0 release, you can specify pegasus
        profile key named data.configuration and associate that with your
        compute sites in the site catalog.</p></li>
</ol></div>
<div class="section">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp61893120"></a>5.3.1. Shared File System</h3></div></div></div>
<p>By default Pegasus is setup to run workflows in the shared file
      system setup, where the worker nodes and the head node of a cluster
      share a filesystem.</p>
<div class="figure">
<a name="idp61894448"></a><p class="title"><b>Figure 5.8. Shared File System Setup</b></p>
<div class="figure-contents"><div class="mediaobject" align="center"><img src="images/data-configuration-sharedfs.png" align="middle" alt="Shared File System Setup"></div></div>
</div>
<br class="figure-break"><p>The data flow is as follows in this case</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem"><p>Stagein Job executes ( either on Submit Host or Head Node ) to
          stage in input data from Input Sites ( 1---n) to a workflow specific
          execution directory on the shared filesystem.</p></li>
<li class="listitem"><p>Compute Job starts on a worker node in the workflow execution
          directory. Accesses the input data using Posix IO</p></li>
<li class="listitem"><p>Compute Job executes on the worker node and writes out output
          data to workflow execution directory using Posix IO</p></li>
<li class="listitem"><p>Stageout Job executes ( either on Submit Host or Head Node )
          to stage out output data from the workflow specific execution
          directory to a directory on the final output site.</p></li>
</ol></div>
<div class="tip" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Tip</h3>
<p>Set <span class="bold"><strong>pegasus.data.configuration</strong></span>
        to <span class="bold"><strong>sharedfs</strong></span> to run in this
        configuration.</p>
</div>
</div>
<div class="section">
<div class="titlepage"><div><div><h3 class="title">
<a name="non_shared_fs"></a>5.3.2. Non Shared Filesystem</h3></div></div></div>
<p>In this setup , Pegasus runs workflows on local file-systems of
      worker nodes with the the worker nodes not sharing a filesystem. The
      data transfers happen between the worker node and a staging / data
      coordination site. The staging site server can be a file server on the
      head node of a cluster or can be on a separate machine.</p>
<p><span class="bold"><strong>Setup</strong></span> </p>
<div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; ">
<li class="listitem"><p>compute and staging site are the different</p></li>
<li class="listitem"><p>head node and worker nodes of compute site don't share a
            filesystem</p></li>
<li class="listitem"><p>Input Data is staged from remote sites.</p></li>
<li class="listitem"><p>Remote Output Site i.e site other than compute site. Can be
            submit host.</p></li>
</ul></div>
<div class="figure">
<a name="idp62789312"></a><p class="title"><b>Figure 5.9. Non Shared Filesystem Setup</b></p>
<div class="figure-contents"><div class="mediaobject" align="center"><img src="images/data-configuration-nonsharedfs.png" align="middle" alt="Non Shared Filesystem Setup"></div></div>
</div>
<br class="figure-break"><p>The data flow is as follows in this case</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem"><p>Stagein Job executes ( either on Submit Host or on staging
          site ) to stage in input data from Input Sites ( 1---n) to a
          workflow specific execution directory on the staging site.</p></li>
<li class="listitem"><p>Compute Job starts on a worker node in a local execution
          directory. Accesses the input data using pegasus transfer to
          transfer the data from the staging site to a local directory on the
          worker node</p></li>
<li class="listitem"><p>The compute job executes in the worker node, and executes on
          the worker node.</p></li>
<li class="listitem"><p>The compute Job writes out output data to the local directory
          on the worker node using Posix IO</p></li>
<li class="listitem"><p>Output Data is pushed out to the staging site from the worker
          node using pegasus-transfer.</p></li>
<li class="listitem"><p>Stageout Job executes ( either on Submit Host or staging site
          ) to stage out output data from the workflow specific execution
          directory to a directory on the final output site.</p></li>
</ol></div>
<p>In this case, the compute jobs are wrapped as <a class="link" href="pegasuslite.php" title="5.4. PegasusLite">PegasusLite</a> instances.</p>
<p>This mode is especially useful for running in the cloud
      environments where you don't want to setup a shared filesystem between
      the worker nodes. Running in that mode is explained in detail <a class="link" href="cloud.php#amazon_aws" title="7.3.1. Amazon EC2">here.</a></p>
<div class="tip" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Tip</h3>
<p>Set p<span class="bold"><strong>egasus.data.configuration</strong></span>
        to <span class="bold"><strong>nonsharedfs</strong></span> to run in this
        configuration. The staging site can be specified using the <span class="bold"><strong>--staging-site</strong></span> option to pegasus-plan.</p>
</div>
</div>
<div class="section">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp62139184"></a>5.3.3. Condor Pool Without a Shared Filesystem</h3></div></div></div>
<p>This setup applies to a condor pool where the worker nodes making
      up a condor pool don't share a filesystem. All data IO is achieved using
      Condor File IO. This is a special case of the non shared filesystem
      setup, where instead of using pegasus-transfer to transfer input and
      output data, Condor File IO is used.</p>
<p><span class="bold"><strong>Setup</strong></span> </p>
<div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; ">
<li class="listitem"><p>Submit Host and staging site are same</p></li>
<li class="listitem"><p>head node and worker nodes of compute site don't share a
            filesystem</p></li>
<li class="listitem"><p>Input Data is staged from remote sites.</p></li>
<li class="listitem"><p>Remote Output Site i.e site other than compute site. Can be
            submit host.</p></li>
</ul></div>
<div class="figure">
<a name="idp62442656"></a><p class="title"><b>Figure 5.10. Condor Pool Without a Shared Filesystem</b></p>
<div class="figure-contents"><div class="mediaobject" align="center"><img src="images/data-configuration-condorio.png" align="middle" alt="Condor Pool Without a Shared Filesystem"></div></div>
</div>
<br class="figure-break"><p>The data flow is as follows in this case</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem"><p>Stagein Job executes on the submit host to stage in input data
          from Input Sites ( 1---n) to a workflow specific execution directory
          on the submit host</p></li>
<li class="listitem"><p>Compute Job starts on a worker node in a local execution
          directory. Before the compute job starts, Condor transfers the input
          data for the job from the workflow execution directory on the submit
          host to the local execution directory on the worker node.</p></li>
<li class="listitem"><p>The compute job executes in the worker node, and executes on
          the worker node.</p></li>
<li class="listitem"><p>The compute Job writes out output data to the local directory
          on the worker node using Posix IO</p></li>
<li class="listitem"><p>When the compute job finishes, Condor transfers the output
          data for the job from the local execution directory on the worker
          node to the workflow execution directory on the submit host.</p></li>
<li class="listitem"><p>Stageout Job executes ( either on Submit Host or staging site
          ) to stage out output data from the workflow specific execution
          directory to a directory on the final output site.</p></li>
</ol></div>
<p>In this case, the compute jobs are wrapped as <a class="link" href="pegasuslite.php" title="5.4. PegasusLite">PegasusLite</a> instances.</p>
<p>This mode is especially useful for running in the cloud
      environments where you don't want to setup a shared filesystem between
      the worker nodes. Running in that mode is explained in detail <a class="link" href="cloud.php#amazon_aws" title="7.3.1. Amazon EC2">here.</a></p>
<div class="tip" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Tip</h3>
<p>Set p<span class="bold"><strong>egasus.data.configuration</strong></span>
        to <span class="bold"><strong>condorio</strong></span> to run in this
        configuration. In this mode, the staging site is automatically set to
        site <span class="bold"><strong>local</strong></span></p>
</div>
</div>
</div>
<div class="navfooter">
<hr>
<table width="100%" summary="Navigation footer">
<tr>
<td width="40%" align="left">
<a accesskey="p" href="mapping_refinement_steps.php">Prev</a> </td>
<td width="20%" align="center"><a accesskey="u" href="running_workflows.php">Up</a></td>
<td width="40%" align="right"> <a accesskey="n" href="pegasuslite.php">Next</a>
</td>
</tr>
<tr>
<td width="40%" align="left" valign="top">5.2. Mapping Refinement Steps </td>
<td width="20%" align="center"><a accesskey="h" href="index.php">Table of Contents</a></td>
<td width="40%" align="right" valign="top"> 5.4. PegasusLite</td>
</tr>
</table>
</div>
</div><?php  
            do_html_footer();
        ?>
