A short term option that just occurred to me would be to run a sort of post-job-action for output datasets, deleting any non-output non-necessary (anymore) intermediate parents.


On Wed, Nov 13, 2013 at 11:59 AM, John Chilton <chilton@msi.umn.edu> wrote:
On Wed, Nov 13, 2013 at 10:34 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
> On Tue, Nov 12, 2013 at 7:13 PM, Ben Gift <corn8bit2@gmail.com> wrote:
>> I'm working with a lot of data on a cluster (condor). If I save all the
>> workflow intermediate data, as Galaxy does by default (and rightfully so),
>> it fills the drives.
>>
>> How can tell Galaxy to use /tmp/ to store all intermediate data in a
>> workflow, and keep the result?
>
> You can't - for a start /tmp is usually machine specific so the /tmp
> used by one cluster node is probably not going to be available
> on the /tmp of the other cluster nodes, and different stages of
> the workflow are likely to be run on different cluster nodes.
>
>> I imagine I'll have to work on how Galaxy handles jobs, but I'm
>> hoping there is something built in for this.
>
> Workflows can mark the output datasets, and the rest are
> automatically hidden/deleted on successful completion
> (but kept and visible on request via the history menu).
>
> It might be nice if we could make that more aggressive and
> actually purge the intermediate files from disk as well?

Ability to have these deleted is not available, but it should be an
option. Here is the most relevant Trello card.

https://trello.com/c/YfLGkJKe

Even this small step will probably require tracking some concept of a
running workflow in the database or a message queue, I don't think
this is being done currently but I think Dannon is working on the
queue piece.

Once that is in place, there are still many things that could be done
better in arena. Nate has mentioned building functionality into object
stores and job planning so that data could be pre-staged where it
needs to be ahead of time in a workflow.

Along similar lines, one could also imagine implementing/configuring
an object store that simply wrote files that are pre-marked for
deletion (once implemented) to faster staging/scratch disk on the
cluster. Having this advanced planning logic built in are probably
prerequistes to allowing the use of named pipes or in memory data
files some day.

A lot of things to work on and there is a long way to go. I have
created a Trello card for this and will link to this thread. But it
should probably be spelled out more concretely and broken into
multiple cards.

https://trello.com/c/dUMOHHmM

-John

>
> Peter
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>   http://lists.bx.psu.edu/
>
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/