Re: [galaxy-dev] [galaxy-user] operating on, and getting at, large files in galaxy...

17 Feb 2011

      Hi Nick,

Yes, these nextgen reads files are huge and getting bigger every quarter!
 But there will be storage issues nomatter whether you use Galaxy or not.
 In fact, i think users are more likely to cleanup files and histories in
galaxy than they are to cleanup NFS folders -- out of sight, out of mind!

Firstly, I think unnecessary intermediate files are more of a problem than
whether or not the file is compressed or not.  Indeed, just transferring
these files back and forth from the cluster takes a while, not to mention
the delay in waiting to be rescheduled for each step.  And so I created a
tool which would do the job of fastq groomer, end-trimmer, process pairs,
and a few other simple tasks -- all in one shot.  I haven't uploaded it to
the toolshed yet but I will.  I hate to duplicate existing tools, but i have
a lot of seq data.  I will also create a fastqilluminabz2 datatype as well
and include it with the tool.

For getting files into galaxy, I created a simple tool which would allow
staff to enter NFS paths and the option to either copy or symlink if the
location was considered stable.  I allowed only certain folders (e.g. /home,
/storage) and added a password, for security.  Similarly, for getting a file
out, all you need is a dinky tool for users to provide a destination path.
since i've got galaxy running as a special galaxy user in a special galaxy
group, file access is restricted (as it should be), so i tell users to
create a dropbox folder in their homedir (and chmod 777).  by creating a
tool like this, you don't need to care how galaxy names the files.  i
deliberately try to not mess around under the hood.  i can upload these to
galaxy toolshed, but like i said, there isn't much to them.

Ed

On Wed, Feb 9, 2011 at 4:17 AM, Nick Schurch <N.Schurch@dundee.ac.uk> wrote:
...
Hi all,
I've recently encountered a few problems when trying to use Galaxy which
are really driving me away from using it as a bioinformatics platform for
NGS. I was wonderinf if there are any simple solutions that I've missed...
Firstly, It seems that while there are a few solutions for getting large
files (a few GB) into a local install of galaxy without going through HTTP,
many tools that operate on these files produces multiple, uncompressed large
files which quickly eat up the disk allocation. This is particularly
significant in a workflow that has multiple processing steps which each
leave behind a large file. With no way to compress or archive files produced
by intermediate steps in a workflow, and no desire to delete them since I
may need to go back to them and they can take hours to re-run, the only two
remaining options seem to be to save them and then delete them.
And this brings me to the second problem. Getting large files out of
Galaxy. The only way to save large files from Galaxy (that I can see) is the
save icon, which downloads the file via http. This take *ages* for a large
file and also causes big headaches for my firefox browser. I've taken a
quick peek at the Galaxy file system to see if I could just copy a file, but
its almost completely indecipherable if you want to find out what file in
the file system corresponds to a file saved from a tool. Is there some way
to get the location of a particular file on the galaxy file system, that I
can just copy?
--
Cheers,
Nick Schurch
Data Analysis Group (The Barton Group),
School of Life Sciences,
University of Dundee,
Dow St,
Dundee,
DD1 5EH,
Scotland,
UK
Tel: +44 1382 388707
Fax: +44 1382 345 893
--
Cheers,
Nick Schurch
Data Analysis Group (The Barton Group),
School of Life Sciences,
University of Dundee,
Dow St,
Dundee,
DD1 5EH,
Scotland,
UK
Tel: +44 1382 388707
Fax: +44 1382 345 893
_______________________________________________
galaxy-user mailing list
galaxy-user@lists.bx.psu.edu
http://lists.bx.psu.edu/listinfo/galaxy-user

Edward Kirton

tags

participants (1)