Best way to work with one directory and many files as 1 input
Hi all, We've just added some new tools based on R scripts to our local Galaxy instance. Most of these tools need to work at the root of the directory containing the input files (up to hundreds of XML files) spread among two or more sub-directories. The directory structure need to be kept since the R tools recursively search for files and use the subdirectories names as classes. To solve this problem we added a "dummy" datatype to our instance so we can upload the input directory as a zip file without Galaxy decompressing it. <datatype extension="dummy_zip" type="galaxy.datatypes.data:Data" mimetype="application/zip" display_in_upload="true" subclass="true" /> However, since our tools can be runned as a workflow and that most of them need this input directory we need to unzip it with R in the job working directory for each tool (about 5 times for the entire workflow). Furthermore, this solution doesn't seem very "clean" if we want to share our tools via the ToolShed. Is there a smart way to handle this kind of input directory that can be achieved with Galaxy default datatypes and/or that doesn't require to unzip a file each time we use a tool ? Any update on a behavior change about zip files (http://lists.bx.psu.edu/pipermail/galaxy-dev/2011-June/005631.html) ? Thanks in advance for any input, Pierre -- Pierre Pericard IE CDD - Projet Peptisan Service Informatique et Bio-informatique (SIB) Station Biologique de Roscoff CNRS-UPMC Place Georges Teissier CS 90074 29688 Roscoff CEDEX FRANCE http://abims.sb-roscoff.fr/
On Tue, Jan 29, 2013 at 4:41 PM, Pierre Pericard <pierre.pericard@sb-roscoff.fr> wrote:
Hi all,
We've just added some new tools based on R scripts to our local Galaxy instance.
Most of these tools need to work at the root of the directory containing the input files (up to hundreds of XML files) spread among two or more sub-directories. The directory structure need to be kept since the R tools recursively search for files and use the subdirectories names as classes.
To solve this problem we added a "dummy" datatype to our instance so we can upload the input directory as a zip file without Galaxy decompressing it.
Have you looked at a composite datatype instead, where the files are stored on disk decompressed? http://wiki.galaxyproject.org/Admin/Datatypes/Composite%20Datatypes Peter
If I'm not mistaking, Composite Datatypes allow for only one directory, whereas we need to keep a constant directory structure with 2 or more sub-directories containing our input files. We have no way to change these tools behavior (obviously not Galaxy-friendly ;-) ) and therefore need to maintain this structure in the job working directory. Pierre. Pierre Pericard IE CDD - Projet Peptisan Service Informatique et Bio-informatique (SIB) Station Biologique de Roscoff CNRS-UPMC Place Georges Teissier CS 90074 29688 Roscoff CEDEX FRANCE http://abims.sb-roscoff.fr/ Le 29/01/2013 17:47, Peter Cock a écrit :
On Tue, Jan 29, 2013 at 4:41 PM, Pierre Pericard <pierre.pericard@sb-roscoff.fr> wrote:
Hi all,
We've just added some new tools based on R scripts to our local Galaxy instance.
Most of these tools need to work at the root of the directory containing the input files (up to hundreds of XML files) spread among two or more sub-directories. The directory structure need to be kept since the R tools recursively search for files and use the subdirectories names as classes.
To solve this problem we added a "dummy" datatype to our instance so we can upload the input directory as a zip file without Galaxy decompressing it. Have you looked at a composite datatype instead, where the files are stored on disk decompressed?
http://wiki.galaxyproject.org/Admin/Datatypes/Composite%20Datatypes
Peter
On Tue, Jan 29, 2013 at 4:58 PM, Pierre Pericard <pierre.pericard@sb-roscoff.fr> wrote:
If I'm not mistaking, Composite Datatypes allow for only one directory, whereas we need to keep a constant directory structure with 2 or more sub-directories containing our input files.
I'm not sure if that is true - the example of HTML output with images comes to mind as a common use-case where subfolder(s) would be expected. I've only had limited first hand experience with Galaxy's composite datatypes myself though.
We have no way to change these tools behavior (obviously not Galaxy-friendly ;-) ) and therefore need to maintain this structure in the job working directory.
Perhaps a tool wrapper could create a dummy folder using symlinks (faster and less wasted disk than copying files), but that isn't ideal. Peter
In that case, could anyone point me to an example of a Composite Datatype which could accept as input an unknown number of files in an unknown number of directories. I can't seem to understand how that would work based on the wiki. But maybe are we anticipating a near functionality of Galaxy. There were talks about changing the way Galaxy handle zip files, is it still on the table ? Thank in advance for any help, Pierre Pierre Pericard IE CDD - Projet Peptisan Service Informatique et Bio-informatique (SIB) Station Biologique de Roscoff CNRS-UPMC Place Georges Teissier CS 90074 29688 Roscoff CEDEX FRANCE http://abims.sb-roscoff.fr/ Le 29/01/2013 18:04, Peter Cock a écrit :
On Tue, Jan 29, 2013 at 4:58 PM, Pierre Pericard <pierre.pericard@sb-roscoff.fr> wrote:
If I'm not mistaking, Composite Datatypes allow for only one directory, whereas we need to keep a constant directory structure with 2 or more sub-directories containing our input files. I'm not sure if that is true - the example of HTML output with images comes to mind as a common use-case where subfolder(s) would be expected. I've only had limited first hand experience with Galaxy's composite datatypes myself though.
We have no way to change these tools behavior (obviously not Galaxy-friendly ;-) ) and therefore need to maintain this structure in the job working directory. Perhaps a tool wrapper could create a dummy folder using symlinks (faster and less wasted disk than copying files), but that isn't ideal.
Peter
I'd suggest: 1) Make your new datatype a subclass of Html - it's a subclass of composite that contains an HTML document as the object's native display - so it can inform users what's there. 2) When constructing these new things, pass the file_path of the Html (composite) dataset subclass to your wrapper on the command line 3) Your wrapper code can construct any arbitrary structure as long as it's rooted in that directory - Galaxy stores it without any fuss. The wrapper should also populate the Html file itself with nicely laid annotation for the user to check out. 4) The key is that all tools that take this new datatype as input must know how to decode this structure - they must be passed the $input.extra_files_path which gives them that same path root. 5) Yes, it's odd and annoying that it's extra_files_path for files_path. Go figure. 6) grep extra_files tools/*.xml to find some examples - I think the velvetg one uses a complex subdirectory structure - but it doesn't really matter - as long as your tools know how to deal with it, it's just a directory to Galaxy! I hope all this helps... On Wed, Jan 30, 2013 at 8:22 PM, Pierre Pericard < pierre.pericard@sb-roscoff.fr> wrote:
In that case, could anyone point me to an example of a Composite Datatype which could accept as input an unknown number of files in an unknown number of directories. I can't seem to understand how that would work based on the wiki.
But maybe are we anticipating a near functionality of Galaxy. There were talks about changing the way Galaxy handle zip files, is it still on the table ?
Thank in advance for any help,
Pierre
Pierre Pericard IE CDD - Projet Peptisan
Service Informatique et Bio-informatique (SIB) Station Biologique de Roscoff CNRS-UPMC Place Georges Teissier CS 90074 29688 Roscoff CEDEX FRANCE http://abims.sb-roscoff.fr/
Le 29/01/2013 18:04, Peter Cock a écrit :
On Tue, Jan 29, 2013 at 4:58 PM, Pierre Pericard
<pierre.pericard@sb-roscoff.fr**> wrote:
If I'm not mistaking, Composite Datatypes allow for only one directory, whereas we need to keep a constant directory structure with 2 or more sub-directories containing our input files.
I'm not sure if that is true - the example of HTML output with images comes to mind as a common use-case where subfolder(s) would be expected. I've only had limited first hand experience with Galaxy's composite datatypes myself though.
We have no way to change these tools behavior (obviously not
Galaxy-friendly ;-) ) and therefore need to maintain this structure in the job working directory.
Perhaps a tool wrapper could create a dummy folder using symlinks (faster and less wasted disk than copying files), but that isn't ideal.
Peter
______________________________**_____________________________
Ok, thanks a lot, I'll try and get back to the mailing list if other problems seem to occur. Pierre. Pierre Pericard IE CDD - Projet Peptisan Service Informatique et Bio-informatique (SIB) Station Biologique de Roscoff CNRS-UPMC Place Georges Teissier CS 90074 29688 Roscoff CEDEX FRANCE http://abims.sb-roscoff.fr/ Le 30/01/2013 11:45, Ross a écrit :
I'd suggest: 1) Make your new datatype a subclass of Html - it's a subclass of composite that contains an HTML document as the object's native display - so it can inform users what's there.
2) When constructing these new things, pass the file_path of the Html (composite) dataset subclass to your wrapper on the command line
3) Your wrapper code can construct any arbitrary structure as long as it's rooted in that directory - Galaxy stores it without any fuss. The wrapper should also populate the Html file itself with nicely laid annotation for the user to check out.
4) The key is that all tools that take this new datatype as input must know how to decode this structure - they must be passed the $input.extra_files_path which gives them that same path root.
5) Yes, it's odd and annoying that it's extra_files_path for files_path. Go figure.
6) grep extra_files tools/*.xml to find some examples - I think the velvetg one uses a complex subdirectory structure - but it doesn't really matter - as long as your tools know how to deal with it, it's just a directory to Galaxy! I hope all this helps...
On Wed, Jan 30, 2013 at 8:22 PM, Pierre Pericard <pierre.pericard@sb-roscoff.fr <mailto:pierre.pericard@sb-roscoff.fr>> wrote:
In that case, could anyone point me to an example of a Composite Datatype which could accept as input an unknown number of files in an unknown number of directories. I can't seem to understand how that would work based on the wiki.
But maybe are we anticipating a near functionality of Galaxy. There were talks about changing the way Galaxy handle zip files, is it still on the table ?
Thank in advance for any help,
Pierre
Pierre Pericard IE CDD - Projet Peptisan
Service Informatique et Bio-informatique (SIB) Station Biologique de Roscoff CNRS-UPMC Place Georges Teissier CS 90074 29688 Roscoff CEDEX FRANCE http://abims.sb-roscoff.fr/
Le 29/01/2013 18:04, Peter Cock a écrit :
On Tue, Jan 29, 2013 at 4:58 PM, Pierre Pericard <pierre.pericard@sb-roscoff.fr <mailto:pierre.pericard@sb-roscoff.fr>> wrote:
If I'm not mistaking, Composite Datatypes allow for only one directory, whereas we need to keep a constant directory structure with 2 or more sub-directories containing our input files.
I'm not sure if that is true - the example of HTML output with images comes to mind as a common use-case where subfolder(s) would be expected. I've only had limited first hand experience with Galaxy's composite datatypes myself though.
We have no way to change these tools behavior (obviously not Galaxy-friendly ;-) ) and therefore need to maintain this structure in the job working directory.
Perhaps a tool wrapper could create a dummy folder using symlinks (faster and less wasted disk than copying files), but that isn't ideal.
Peter
___________________________________________________________
participants (3)
-
Peter Cock
-
Pierre Pericard
-
Ross