<?php  
            include_once( $_SERVER['DOCUMENT_ROOT']."/static/includes/common.inc.php" );
            do_html_header("Documentation");
        ?><div id="content">
<div class="navheader">
<table width="100%" summary="Navigation header"><tr>
<td width="20%" align="left">
<a accesskey="p" href="ref_output_mapper.php">Prev</a> </td>
<td width="60%" align="center"><a accesskey="h" href="index.php">Table of Contents</a></td>
<td width="20%" align="right"> <a accesskey="n" href="optimization.php">Next</a>
</td>
</tr></table>
<hr>
</div>
<div class="section">
<div class="titlepage"><div><div><h2 class="title" style="clear: both">
<a name="data_cleanup"></a>9.5. Data Cleanup</h2></div></div></div>
<div class="toc"><dl class="toc">
<dt><span class="section"><a href="data_cleanup.php#idp56372096">9.5.1. Data Cleanup in Hierarchal Workflows</a></span></dt>
<dt><span class="section"><a href="data_cleanup.php#idp56376656">9.5.2. Executables used for Directory Creation and Cleanup Jobs</a></span></dt>
</dl></div>
<p>When executing large workflows, users often may run out of diskspace
    on the remote clusters / staging site. Pegasus provides a couple of ways
    of enabling automated data cleanup on the staging site ( i.e the scratch
    space used by the workflows). This is achieved by adding data cleanup jobs
    to the executable workflow that the Pegasus Mapper generates. These
    cleanup jobs are responsible for removing files and directories during the
    workflow execution. To enable data cleanup you can pass the --cleanup
    option to pegasus-plan . The value passed decides the cleanup strategy
    implemented</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem"><p><span class="bold"><strong>none </strong></span> disables cleanup
        altogether. The planner does not add any cleanup jobs in the
        executable workflow whatsoever.</p></li>
<li class="listitem"><p><span class="bold"><strong>leaf</strong></span> the planner adds a leaf
        cleanup node per staging site that removes the directory created by
        the create dir job in the workflow</p></li>
<li class="listitem"><p><span class="bold"><strong>inplace t</strong></span>he mapper adds cleanup
        nodes per level of the workflow in addition to leaf cleanup nodes. The
        nodes remove files no longer required during execution. For example,
        an added cleanup node will remove input files for a particular compute
        job after the job has finished successfully. This is the default
        value.</p></li>
</ol></div>
<div class="note" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Note</h3>
<p>For large workflows with lots of files, the inplace strategy may
      take a long time as the algorithm works at a per file level to figure
      out when it is safe to remove a file.</p>
</div>
<p>Behaviour of the cleanup strategies implemented in the Pegasus
    Mapper can be controlled by properties described <a class="link" href="properties.php#cleanup_props" title="12.3.13. Cleanup Properties">here</a> .</p>
<div class="section">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp56372096"></a>9.5.1. Data Cleanup in Hierarchal Workflows</h3></div></div></div>
<p>By default, for hierarchal workflows the inplace cleanup is always
      turned off. This is because the cleanup algorithm ( InPlace ) does not
      work across the sub workflows. For example, if you have two DAX jobs in
      your top level workflow and the child DAX job refers to a file generated
      during the execution of the parent DAX job, the InPlace cleanup
      algorithm when applied to the parent dax job will result in the file
      being deleted, when the sub workflow corresponding to parent DAX job is
      executed. This would result in failure of sub workflow corresponding to
      the child DAX job, as the file deleted is required to present during
      it's execution.</p>
<p>In case there are no data dependencies across the dax jobs, then
      yes you can enable the InPlace algorithm for the sub dax’es . To do this
      you can set the property</p>
<div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>pegasus.file.cleanup.scope deferred</p></li></ul></div>
<p>This will result in cleanup option to be picked up from the
      arguments for the DAX job in the top level DAX .</p>
</div>
<div class="section">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp56376656"></a>9.5.2. Executables used for Directory Creation and Cleanup Jobs</h3></div></div></div>
<p>Starting 4.0, Pegasus has changed the way how the scratch
      directories are created on the staging site. The planner now prefers to
      schedule the directory creation and cleanup jobs locally. The jobs refer
      to python based tools, that call out to protocol specific clients to
      determine what client is picked up. For protocols, where specific remote
      cleanup and directory creation clients don't exist ( for example gridftp
      ), the python tools rely on the corresponding transfer tool to create a
      directory by initiating a transfer of an empty file. The python clients
      used to create directories and remove files are called</p>
<div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; ">
<li class="listitem"><p>pegasus-create-dir</p></li>
<li class="listitem"><p>pegasus-cleanup</p></li>
</ul></div>
<p>Both these clients inspect the URL's to to determine what
      underlying client to pick up.</p>
<div class="table">
<a name="idp56381232"></a><p class="title"><b>Table 9.5. Clients interfaced to by pegasus-create-dir</b></p>
<div class="table-contents"><table summary="Clients interfaced to by pegasus-create-dir" border="1">
<colgroup>
<col>
<col>
</colgroup>
<thead><tr>
<th>Client</th>
<th>Used For</th>
</tr></thead>
<tbody>
<tr>
<td>globus-url-copy</td>
<td>to create directories against a gridftp/ftp
              server</td>
</tr>
<tr>
<td>srm-mkdir</td>
<td>to create directories against a SRM server.</td>
</tr>
<tr>
<td>mkdir</td>
<td>to create a directory on the local filesystem</td>
</tr>
<tr>
<td>pegasus-s3</td>
<td>to create a S3 bucket in the Amazon cloud</td>
</tr>
<tr>
<td>gsutil</td>
<td>to create a Google Storage bucket</td>
</tr>
<tr>
<td>scp</td>
<td>staging files using scp</td>
</tr>
<tr>
<td>imkdir</td>
<td>to create a directory against an IRODS server</td>
</tr>
</tbody>
</table></div>
</div>
<br class="table-break"><div class="table">
<a name="idp55289840"></a><p class="title"><b>Table 9.6. Clients interfaced to by pegasus-cleanup</b></p>
<div class="table-contents"><table summary="Clients interfaced to by pegasus-cleanup" border="1">
<colgroup>
<col>
<col>
</colgroup>
<thead><tr>
<th>Client</th>
<th>Used For</th>
</tr></thead>
<tbody>
<tr>
<td>globus-url-copy</td>
<td>to remove a file against a gridftp/ftp server. In this
              case a zero byte file is created</td>
</tr>
<tr>
<td>srm-rm</td>
<td>to remove files against a SRM server.</td>
</tr>
<tr>
<td>rm</td>
<td>to remove a file on the local filesystem</td>
</tr>
<tr>
<td>pegasus-s3</td>
<td>to remove a file from the s3 bucket.</td>
</tr>
<tr>
<td>gsutil</td>
<td>to remove an object from a Google Storage bucket</td>
</tr>
<tr>
<td>scp</td>
<td>to remove a file against a scp server. In this case a
              zero byte file is created.</td>
</tr>
<tr>
<td>irm</td>
<td>to remove a file against an IRODS server</td>
</tr>
</tbody>
</table></div>
</div>
<br class="table-break"><p>The only case, where the create dir and cleanup jobs are scheduled
      to run remotely is when for the staging site, a file server is
      specified.</p>
</div>
</div>
<div class="navfooter">
<hr>
<table width="100%" summary="Navigation footer">
<tr>
<td width="40%" align="left">
<a accesskey="p" href="ref_output_mapper.php">Prev</a> </td>
<td width="20%" align="center"><a accesskey="u" href="data_management.php">Up</a></td>
<td width="40%" align="right"> <a accesskey="n" href="optimization.php">Next</a>
</td>
</tr>
<tr>
<td width="40%" align="left" valign="top">9.4. Output Mappers </td>
<td width="20%" align="center"><a accesskey="h" href="index.php">Table of Contents</a></td>
<td width="40%" align="right" valign="top"> Chapter 10. Optimizing Workflows for Efficiency and Scalability</td>
</tr>
</table>
</div>
</div><?php  
            do_html_footer();
        ?>
