Re: [galaxy-dev] [galaxy-user] operating on, and getting at, large files in galaxy...
Nick Schurch wrote:
Hi all,
I've recently encountered a few problems when trying to use Galaxy which are really driving me away from using it as a bioinformatics platform for NGS. I was wonderinf if there are any simple solutions that I've missed...
Hi Nick, We've had some internal discussion and proposed some solutions which would hopefully make Galaxy more useful for your environment.
Firstly, It seems that while there are a few solutions for getting large files (a few GB) into a local install of galaxy without going through HTTP, many tools that operate on these files produces multiple, uncompressed large files which quickly eat up the disk allocation. This is particularly significant in a workflow that has multiple processing steps which each leave behind a large file. With no way to compress or archive files produced by intermediate steps in a workflow, and no desire to delete them since I may need to go back to them and they can take hours to re-run, the only two remaining options seem to be to save them and then delete them.
We've dealt with this locally by implementing compression in the underlying filesystem (ZFS), but this requires a fileserver that runs Solaris (or a derivative) or FreeBSD. Btrfs also supports compression but I would be a bit more wary of losing my data with btrfs since it is less mature and can't recover corrupted filesystems. Fusecompress would also be an option. We would strongly recommend performing regular backups regardless of any filesystem-level choice. Unfortunately this is a tricky problem to solve within Galaxy itself. While some tools can operate on compressed files directly, many cannot and so compressing all outputs could prove to be very CPU intensive and a waste of time if the next step will have to decompress the file. There has been some discussion of how to implement transparent compression and other complex underlying data management directly in Galaxy, but any work on it is not likely to commence soon.
And this brings me to the second problem. Getting large files out of Galaxy. The only way to save large files from Galaxy (that I can see) is the save icon, which downloads the file via http. This take *ages* for a large file and also causes big headaches for my firefox browser. I've taken a quick peek at the Galaxy file system to see if I could just copy a file, but its almost completely indecipherable if you want to find out what file in the file system corresponds to a file saved from a tool. Is there some way to get the location of a particular file on the galaxy file system, that I can just copy?
This is certainly something we can implement and will be working on fairly soon. There have been quite a few requests to integrate more tightly with environments where Galaxy users exist as system users. There's an issue in our tracker which you can follow here: https://bitbucket.org/galaxy/galaxy-central/issue/106/ --nate
-- Cheers,
Nick Schurch
Data Analysis Group (The Barton Group), School of Life Sciences, University of Dundee, Dow St, Dundee, DD1 5EH, Scotland, UK
Tel: +44 1382 388707 Fax: +44 1382 345 893
-- Cheers,
Nick Schurch
Data Analysis Group (The Barton Group), School of Life Sciences, University of Dundee, Dow St, Dundee, DD1 5EH, Scotland, UK
Tel: +44 1382 388707 Fax: +44 1382 345 893
_______________________________________________ galaxy-user mailing list galaxy-user@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-user
Hi Nick, If you're running your own local instance, then nothing is impossible - it's just a bit ugly... Nate Coraor wrote, On 02/21/2011 12:36 PM:
[...] many tools that operate on these files produces multiple, uncompressed large files which quickly eat up the disk allocation. [...] With no way to compress or archive files produced by intermediate steps in a workflow, [...]
Here's a tool that compresses an input galaxy dataset and then deletes the input file. Deleting the input dataset from underneath galaxy's feet obviously goes against everything galaxy stands for, and I'm sure the Galaxy team does not endorse such solutions. It will also slightly make your database out-of-sync with the real files on the disk. But hey - desperate times call for desperate means :) ======== <tool id="cshl_compress_input" name="Compress Input File"> <description>for advanced users only!</description> <command>gzip -c '$input' > '$output' && rm '$input'</command> <inputs> <param format="data" name="input" type="data" label="Dataset to Compress" /> <param format="data" name="waitforinput" type="data" label="Tool to wait for" /> </inputs> <outputs> <data format="gzip" name="output" /> </outputs> <help> **What it does** DO NOT USE THIS TOOL UNLESS YOU KNOW WHAT YOU'RE DOING. </help> </tool> ============ The second "input" parameter in this tool is there only to force this tool to run after another tool (which needs the uncompressed input file) - you should connect this tool carefully in your workflow. Making the output format "gzip" ensures the new compressed files can't be used with any regular tool. Then create a similar "uncompress" tool that does the opposite.
And this brings me to the second problem. Getting large files out of Galaxy. The only way to save large files from Galaxy (that I can see) is the save icon, which downloads the file via http. This take *ages* for a large file and also causes big headaches for my firefox browser.
Here are three solutions (in varying level of ugliness) to get files out of galaxy: 1. This simple tool will tell you the full path of your dataset: ========= <tool id="cshl_get_dataset_full_path" name="Get dataset full path"> <description>for advanced users only!</description> <command>readlink -f '$input' > '$output'</command> <inputs> <param format="data" name="input" type="data" label="Show full path of dataset" /> </inputs> <outputs> <data format="txt" name="output" /> </outputs> <help> **What it does** DO NOT USE THIS TOOL UNLESS YOU KNOW WHAT YOU'RE DOING. </help> </tool> ========= run it on any input dataset, the output will contain the full path on your local system. It goes without saying that this is a security hazard, and only use this tool if you know what you're doing, and you trust your users. Once you have the full path, just access the file directly out of galaxy. 2. The following tool allows the user to export a dataset into a hard-coded directory (/tmp/galaxy_export in this example). This is just a proof of concept, and for a production environment you'll need to add validators to the "description" variable to prevent users from adding unwanted characters. But it works - once the tool is run, the selected dataset will appear under /tmp/galaxy_export/$USER/ . ============= <tool id="cshl_export_to_local" name="Export to local file"> <description>for advanced users only!</description> <command> mkdir -p /tmp/galaxy_export/$userEmail && ln -s '$input' '/tmp/galaxy_export/$userEmail/${input.hid}_${description}.${input.extension}' </command> <inputs> <param format="data" name="input" type="data" label="Dataset to Export" /> <param name="description" type="text" size="30" label="File name" /> </inputs> <outputs> <data format="txt" name="output" /> </outputs> <help> **What it does** DO NOT USE THIS TOOL UNLESS YOU KNOW WHAT YOU'RE DOING. </help> </tool> ================ 3. Last but not least, if you have access to the database, getting the dataset path is easy if you now the dataset number or the dataset hash-id (and you have them as links on the galaxy web page). This solution is not for the faint of heart, but if you want I show examples of how to get from one to the other. -gordon
participants (2)
-
Assaf Gordon
-
Nate Coraor