Re: [galaxy-user] Handling large files in Galaxy

14 Jul 2009

      Hello Brad,

Galaxy already will handle large files in the way that you describe if 
you upload the file(s) to a library, creating what we refer to as a 
library dataset.

With library datasets, there is 1 file on disk, even if users "import 
the library dataset" into their history for their own analysis.  When 
users do this, it may look like they have created their own copy of the 
file on disk, but they are really just working with a pointer to the 
single disk file.

If you do not associate any roles with the  the "access" permission on 
the library dataset, it is considered public, and anyone can access it. 
  However, if you associate roles with the access permission on the 
dataset, a user must have every role associated with the access 
permission in order to access the dataset in the library.  Galaxy 
performs checks to ensure that the roles associated with the access 
permission on library datasets do not result in the dataset becoming 
inaccessible by all users.

Regarding integration of your proprietary C++ tool if you have questions 
about that, please refer to our wiki at 
http://g2.trac.bx.psu.edu/wiki/AddingTools.

Please don't hesitate to contact us / me with any additional questions 
as you work through this process, and we'll make sure you get all of the 
help you need for this work.

Thanks very much,

Greg Von Kuster
Galaxy Development Team

Brad Chapman wrote:
...
Hi all;
I've recently gotten a local Galaxy install up and running for our
group. We do a lot of short read sequencing analysis and are looking
at Galaxy as a framework to present the data and custom analyses
associated with it. One of our main interests is scaling the
presentation to large fastq and alignment files.
Specifically, we have a case where we'd like to make a large ~300Gb
alignment file available to users to query and retrieve sections of
alignments corresponding to genomic coordinates. We have a custom
C++ program that does this, and would like to plug it in through the
tools interface. We'd ideally like to use the Library permissions
interface to make this available to certain users.
Would anyone be able to offer some advice about the best way to
handle this? The standard upload, history, analyze would not be
ideal since this large file would be copied around. We've
brainstormed 3 different ways to approach this:
- Have "special" uploaded files which are actually symlinks to the
  original file and do not get copied. This looks relatively
  difficult on my initial assessment.
- Pass the logged in user to the C++ program and embed the logic of
  finding the right file within the external tool. Here we would
  need some advice about if it were possible to pass the current
  user through the tools interface.
- The hack solution: upload a file that is actually just a
  link reference to the desired file, and this file gets passed
  to the external tool. The tool then can read the tiny file, know what
  large file to access, and proceed from there. This would involve
  some new datatype integration to handle the hack.
I am still relatively uninitiated in the Galaxy way, so could use
some advice on if any of these solutions are more likely to work
smoothly then others. Generally, what sort of approach is Galaxy
taking towards increasingly massive files? Is anyone else
doing something similar?
Thanks for any thoughts,
Brad
_______________________________________________
galaxy-user mailing list
galaxy-user@bx.psu.edu
http://mail.bx.psu.edu/cgi-bin/mailman/listinfo/galaxy-user