As a note on this, it would be important to not include "hidden" datasets in the output, or at least make it an option to not included them. With regards the large fastq (and SAM), there were murmurs recently about an option to turn off metadata for files over a particular size or something similar. Any progress on that?

Cheers
Dennis

On Wed, Jun 30, 2010 at 7:56 AM, Assaf Gordon <gordon@cshl.edu> wrote:
Kudos for this nice feature. It would be very useful, both for backups and for exchanging histories between Galaxies.

But as usual ;) I have couple of comments:

1. Exporting big files doesn't work.
You first create the tar file, then add the individual datasets, and only then stream the tar file to the user.
very bad. Try exporting a history with 18M-paired-end-76nt-reads FASTQ files (and remember that they are duplicated because you guys still insist on grooming).
The proxy (apache at least) will surely timeout.

BTW, since you're using native python Tarfile, the python process will keep being on 100% CPU long after the client connection timed out. 20 users exporting large histories at the same time will constitute a DDOS attack ;)

2. Importing large files (after you'll fix exporting large files...) will cause the same problems as uploading huge FASTQ files (and you're seeing more and more complaints about those).


Combining 1+2, I would suggest the following:
1. Make the export a new "job". This will allow it to run for a long time, in a separate process, not blocking python and not tying the user to the browser (also remember: downloading 3GB with your browser is no fun).
Since your code already store "temporary" tarballs in a dedicated "export" directory, just keep them there for an arbitrary amount of time (let say: 4 days),
and provide the user will a link to download the history at his/her own convenience.

2. Allow importing a history from a URL.
This will also have the benefit of not transferring huge tarballs through the users' personal computer when they want to transfer histories between Galaxies.
They will go to one Galaxy, run the "export" job, get a download link, wait until the export job is done, then go to the other Galaxy, and simply import the history with that link.


And a bonus:
You've checked the tarball for malicious files, but not the JSON attributes.
If I import a history with a dataset that have:
 [ { "file_name": "../../../etc/passwd" } ]

I can get any file on the system into Galaxy (this happens in <galaxy>/lib/galaxy/web/controllers/history.py:580).
If I import "/dev/urandom" it'll be even more fun.

I would recommend against checking for ".." or "/" (as in line 457), because a determined person could possibly circumvent that with tricks like ".\u002E/etc/passwd" etc.).
Instead, construct the full path (as set in the dataset attributes), then check with os.path.abspath() that it resides in the temporary import directory.


Comments are welcomed,
       -gordon
_______________________________________________
galaxy-dev mailing list
galaxy-dev@lists.bx.psu.edu
http://lists.bx.psu.edu/listinfo/galaxy-dev