Hi, I've been working off of a local instance of Galaxy (6799:40f1816d6857) to develop tools and I'm running into some issues which I'm not sure how to resolve idiomatically. First, some of the tools I'm trying to write Galaxy wrappers for take a directory as input. The directory contains some metadata and some related files (a FASTA alignment; a Newick tree; etc.). Originally I tried making a data type that was a .tar.gz'd directory and a converter that would untar it to a directory by doing approximately `rm $output; mkdir $output; tar xf $input -C $output`. This was causing errors in Galaxy; I can get these if it would be useful, but I figured that this is probably an unsupported thing anyway. The way that I'm handling it now is to untar the archive every time that the tool is run, but it seems like there's probably a better way of setting this up. Second, these tools also expect their inputs to have a meaningful file extension, but it seems like the filenames that Galaxy gives to the tool all just have ".dat" as the extension. I was getting around this by doing `ln -s $input input.ext`, but again, this seems suboptimal. It's also complicated by that some of these programs can take a FASTA or Stockholm alignment, but decide which parser to use by the extension. I'd like to be able to support both, but it seems like that would require somehow getting the file type out of the metadata and doing some sort of check against that. I've seen some mailing list posts that talk about doing the same symlink trick, but I'm kind of hoping that there's some better method available now. Thanks for any insight into writing idiomatic Galaxy tools, ~Aaron
I'm no authority... but given that nobody else has replied yet I'll give my opinion. I think I would approach the directory problem with a wrapper script that takes arguments for each of the components needed by the tool. The script could lay out the various files as expected in the working directory and call the script. I think that's cleaner than expecting users to build a tar archive with the proper structure. brad On Apr 3, 2012, at 8:58 PM, Aaron Gallagher wrote:
Hi,
I've been working off of a local instance of Galaxy (6799:40f1816d6857) to develop tools and I'm running into some issues which I'm not sure how to resolve idiomatically.
First, some of the tools I'm trying to write Galaxy wrappers for take a directory as input. The directory contains some metadata and some related files (a FASTA alignment; a Newick tree; etc.). Originally I tried making a data type that was a .tar.gz'd directory and a converter that would untar it to a directory by doing approximately `rm $output; mkdir $output; tar xf $input -C $output`. This was causing errors in Galaxy; I can get these if it would be useful, but I figured that this is probably an unsupported thing anyway.
The way that I'm handling it now is to untar the archive every time that the tool is run, but it seems like there's probably a better way of setting this up.
Second, these tools also expect their inputs to have a meaningful file extension, but it seems like the filenames that Galaxy gives to the tool all just have ".dat" as the extension. I was getting around this by doing `ln -s $input input.ext`, but again, this seems suboptimal. It's also complicated by that some of these programs can take a FASTA or Stockholm alignment, but decide which parser to use by the extension. I'd like to be able to support both, but it seems like that would require somehow getting the file type out of the metadata and doing some sort of check against that.
I've seen some mailing list posts that talk about doing the same symlink trick, but I'm kind of hoping that there's some better method available now.
Thanks for any insight into writing idiomatic Galaxy tools, ~Aaron
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
-- Brad Langhorst langhorst@neb.com 978-380-7564
On Apr 4, 2012, at 11:11 AM, Langhorst, Brad wrote:
I'm no authority... but given that nobody else has replied yet I'll give my opinion.
I think I would approach the directory problem with a wrapper script that takes arguments for each of the components needed by the tool. The script could lay out the various files as expected in the working directory and call the script. I think that's cleaner than expecting users to
Sorry - that should read "call the tool" not "call the script" (stack too deep... ) Brad
On Apr 4, 2012, at 8:11 AM, Langhorst, Brad wrote:
I think I would approach the directory problem with a wrapper script that takes arguments for each of the components needed by the tool. The script could lay out the various files as expected in the working directory and call the script. I think that's cleaner than expecting users to build a tar archive with the proper structure.
Sorry if I wasn't more clear: the tools take _the entire directory_ (which we call reference packages, to be less ambiguous in the rest of this e-mail) as the input, not parts of it passed separately. Building these reference packages is not a problem. They're a fundamental part of a lot of analyses we do, and as such, we have tools to build them easily. For the Galaxy instance I'm trying to set up, though, most of the reference packages that users need will be provided as shared data. ~Aaron
On Apr 4, 2012, at 1:48 PM, Aaron Gallagher wrote:
On Apr 4, 2012, at 8:11 AM, Langhorst, Brad wrote:
I think I would approach the directory problem with a wrapper script that takes arguments for each of the components needed by the tool. The script could lay out the various files as expected in the working directory and call the script. I think that's cleaner than expecting users to build a tar archive with the proper structure.
Sorry if I wasn't more clear: the tools take _the entire directory_ (which we call reference packages, to be less ambiguous in the rest of this e-mail) as the input, not parts of it passed separately. Building these reference packages is not a problem. They're a fundamental part of a lot of analyses we do, and as such, we have tools to build them easily. For the Galaxy instance I'm trying to set up, though, most of the reference packages that users need will be provided as shared data.
If the number of files in a reference package is small, then it's not unreasonable to ask users to specify each one in addition to their input. I have workflows where users specify a list of reads, a gtf file, a bed file, and an interval file. It's not terribly onerous because galaxy will automatically choose an appropriate file from the history when starting the workflow. We import "sets" of these reference type files to histories by checking them off in the shared data area and import to histories en masse. I guess you could consider a tarball of a directory to be a distinct file type, and proceed along your original path - but I think you're right that something seems fishy about that. Maybe more a more specific example including all the files in the directory, the input data, and the specific tool that will do the analysis would make this clearer. Best wishes, Brad -- Brad Langhorst langhorst@neb.com 978-380-7564
On Apr 4, 2012, at 11:03 AM, Langhorst, Brad wrote:
Maybe more a more specific example including all the files in the directory, the input data, and the specific tool that will do the analysis would make this clearer.
One particular tool is running pplacer (https://github.com/matsen/pplacer) given an input alignment and reference package. The invocation, taken from our public example (https://github.com/fhcrc/microbiome-demo) resembles:
pplacer -c vaginal_16s.refpkg src/p4z1r36.fasta
The reference package is laid out like so:
$ ls -l vaginal_16s.refpkg total 7936 -rwxr-xr-x 1 habnabit staff 987 Feb 8 16:46 CONTENTS.json -rwxr-xr-x 1 habnabit staff 3911 Feb 8 16:46 RAxML_info.bv_refs_aln -rwxr-xr-x 1 habnabit staff 37984 Feb 8 16:46 RAxML_result.bv_refs_aln -rwxr-xr-x 1 habnabit staff 514875 Feb 8 16:46 bacteria16S_508_mod5.cm -rwxr-xr-x 1 habnabit staff 79661 Feb 8 16:46 bv_refdata.csv -rwxr-xr-x 1 habnabit staff 1912382 Feb 8 16:46 bv_refs.sto -rwxr-xr-x 1 habnabit staff 1450284 Feb 8 16:46 bv_refs_aln.fasta -rwxr-xr-x 1 habnabit staff 397 Feb 13 15:33 phylo_model.json -rwxr-xr-x 1 habnabit staff 41259 Feb 8 16:46 tax_table.csv
Tools, including pplacer, read the CONTENTS.json file, which is a manifest that describes the other files contained in the directory. In most tools, there's no way of specifying these things other than passing the entire reference package directory. We'd never had any issues previously with passing around directories. ~Aaron
Hi Aaron, If you are designing a new datatype for a (set of) tool(s) and this datatype requires a bunch of files to be in a directory and these files are generally only useful when they are bundled together as a single unit (i.e. you wouldn't normally want any one of the files to exist as a separate history item), then you may want to look at composite datasets: http://wiki.g2.bx.psu.edu/Admin/Datatypes/Composite%20Datatypes That works well for user-supplied data. If you just want to provide reference data for users, a look into the tool_data_tables would be a good start. Thanks for using Galaxy, Dan On Apr 4, 2012, at 1:48 PM, Aaron Gallagher wrote:
On Apr 4, 2012, at 8:11 AM, Langhorst, Brad wrote:
I think I would approach the directory problem with a wrapper script that takes arguments for each of the components needed by the tool. The script could lay out the various files as expected in the working directory and call the script. I think that's cleaner than expecting users to build a tar archive with the proper structure.
Sorry if I wasn't more clear: the tools take _the entire directory_ (which we call reference packages, to be less ambiguous in the rest of this e-mail) as the input, not parts of it passed separately. Building these reference packages is not a problem. They're a fundamental part of a lot of analyses we do, and as such, we have tools to build them easily. For the Galaxy instance I'm trying to set up, though, most of the reference packages that users need will be provided as shared data.
~Aaron
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
On Apr 5, 2012, at 8:45 AM, Daniel Blankenberg wrote:
If you are designing a new datatype for a (set of) tool(s) and this datatype requires a bunch of files to be in a directory and these files are generally only useful when they are bundled together as a single unit (i.e. you wouldn't normally want any one of the files to exist as a separate history item), then you may want to look at composite datasets: http://wiki.g2.bx.psu.edu/Admin/Datatypes/Composite%20Datatypes
I was looking at composite datatypes, but it seems that they require knowing the filename in advance? While the filenames can be normalized within a reference package, they typically aren't. I'll take a second look at this, though.
That works well for user-supplied data. If you just want to provide reference data for users, a look into the tool_data_tables would be a good start.
There will be both user-supplied and presupplied data. I'll look at this as well. Thanks, ~Aaron
participants (3)
-
Aaron Gallagher
-
Daniel Blankenberg
-
Langhorst, Brad