EST download from any source?
Hi, I'm trying to wrap up my own tool in Galaxy. The input to my tool include the set of EST (such as the entire human collection). I tried using UCSC genome browser but it doesn't seem to let me download the whole human collection due to the size of the data. I tried to implement my own FTP client and try to wrap that up in galaxy. I intend to have the FTP client download data from NCBI's FTP server directly, and have the downloaded files as output files to feed back into galaxy. I intend to make the FTP client somewhat generic, so as not to enforce the type of files. Though in my case, I would be download gzipped genbank files. But galaxy support for multiple output files kind of tripped me over. I do not know exactly what to do, since it looks as if galaxy requires a strict naming convention for the outputs, according to http://gmod.827538.n3.nabble.com/Multiple-output-not-known-until-tool-run-td... case I have is obviously that the number of files would not be known until run time). I guess it doesn't really, really matter, if I send those files, whatever the naming convention are, and fed it to a gzip decompressor (which I am planning to do a simple wrap up, just to be able to handle my stuff). Then it should all work out fine. Alternatively, I can just ask user to download from NCBI ftp themselves, decompress them, and upload it to galaxy. What's the best approach here? And I noticed that file types does not include genbank types nor gzip types. Is there some generic type I could use? Just Data class? Timothy
On 09/14/2011 10:39 AM, Timothy Wu wrote: //
Alternatively, I can just ask user to download from NCBI ftp themselves, decompress them, and upload it to galaxy.
What's the best approach here?
How about: you download the data once, and then offer it as a 'data library' to your users. This way you avoid data duplication.
And I noticed that file types does not include genbank types nor gzip types. Is there some generic type I could use? Just Data class?
We treat GenBank files as "txt". This works fine with the EMBOSS tools. Regards, Hans
Timothy
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
On Wed, Sep 14, 2011 at 5:21 PM, Hans-Rudolf Hotz <hrh@fmi.ch> wrote:
On 09/14/2011 10:39 AM, Timothy Wu wrote:
//
Alternatively, I can just ask user to download from NCBI ftp themselves, decompress them, and upload it to galaxy.
What's the best approach here?
How about: you download the data once, and then offer it as a 'data library' to your users. This way you avoid data duplication.
I do not know how to prepare a "data library". However, I think this is less than optimal as the data itself may be updated. And I don't think data duplication is really a problem if the users install their own version of Galaxy. I think I need some kind of "data source" implementation that allow user to obtain the data themselves. However with the current tool XML definition, I don't know how to have a FTP download tool to download EST data from NCBI to Galaxy directly. Oh well, I guess I'll resort to users uploading zipped EST genbank files themselves by uploading to galaxy via FTP if all else fails. Or I'll just have the FTP tool to also parses the parses the genbank downloaded and merges all data to a single file. But this really limits the flexibility of the FTP tool which could be more generic. Timothy
On Thu, Sep 15, 2011 at 9:32 AM, Timothy Wu <2huggie@gmail.com> wrote:
I think I need some kind of "data source" implementation that allow user to obtain the data themselves. However with the current tool XML definition, I don't know how to have a FTP download tool to download EST data from NCBI to Galaxy directly.
Perhaps I have misunderstood you, but I'd just use the provided "Upload Data" tool, and paste in the FTP URL for the file, e.g. an NCBI FTP URL. Peter
On Thu, Sep 15, 2011 at 4:58 PM, Peter Cock <p.j.a.cock@googlemail.com>wrote:
On Thu, Sep 15, 2011 at 9:32 AM, Timothy Wu <2huggie@gmail.com> wrote:
I think I need some kind of "data source" implementation that allow user
to
obtain the data themselves. However with the current tool XML definition, I don't know how to have a FTP download tool to download EST data from NCBI to Galaxy directly.
Perhaps I have misunderstood you, but I'd just use the provided "Upload Data" tool, and paste in the FTP URL for the file, e.g. an NCBI FTP URL.
I wasn't aware that the Upload data tool could take a FTP URL, so thanks for letting me know. Unfortunately that doesn't take a wild card. I need to have the path specification like this " ftp://ftp.ncbi.nih.gov/genbank/gbest*.seq.gz" at the minimum. Actually my tool is more versatile (though I don't need it for this particular application). I could specify ftp://ftp.ncbi.nih.gov/genomes/*/*/NC_*.fna and grab all the fasta files for all chromosome of all species under the genomes directory. I thought it would be a nice tool to have in my galaxy arsenal. Timothy
On Thu, Sep 15, 2011 at 10:32 AM, Timothy Wu <2huggie@gmail.com> wrote:
On Thu, Sep 15, 2011 at 4:58 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
Perhaps I have misunderstood you, but I'd just use the provided "Upload Data" tool, and paste in the FTP URL for the file, e.g. an NCBI FTP URL.
I wasn't aware that the Upload data tool could take a FTP URL, so thanks for letting me know.
Unfortunately that doesn't take a wild card.
I need to have the path specification like this "ftp://ftp.ncbi.nih.gov/genbank/gbest*.seq.gz" at the minimum.
Actually my tool is more versatile (though I don't need it for this particular application).
I could specify
ftp://ftp.ncbi.nih.gov/genomes/*/*/NC_*.fna
and grab all the fasta files for all chromosome of all species under the genomes directory. I thought it would be a nice tool to have in my galaxy arsenal.
Timothy
That volume of data shouldn't really be uploaded into individual Galaxy user's histories (not unless you have a Galaxy setup with an unusually high disk quota per user - lucky you). This seems ideal for the Galaxy data library functionality, where the Galaxy admin loads the big data sets and makes them available to all the Galaxy users (or a subset using access controls). For the user's history the files are just linked to - so there is only one copy on disk. http://wiki.g2.bx.psu.edu/Admin/Data%20Libraries/Libraries However, we'd also like easy access to (some of) the files on ftp://ftp.ncbi.nih.gov/genomes/ so a new "NCBI Genomes FTP-site Data Source Tool" as part of Galaxy would be nice (like the existing UCSC data source etc). Peter
On Thu, Sep 15, 2011 at 5:32 PM, Timothy Wu <2huggie@gmail.com> wrote:
On Thu, Sep 15, 2011 at 4:58 PM, Peter Cock <p.j.a.cock@googlemail.com>wrote:
On Thu, Sep 15, 2011 at 9:32 AM, Timothy Wu <2huggie@gmail.com> wrote:
I think I need some kind of "data source" implementation that allow user
to
obtain the data themselves. However with the current tool XML definition, I don't know how to have a FTP download tool to download EST data from NCBI to Galaxy directly.
Perhaps I have misunderstood you, but I'd just use the provided "Upload Data" tool, and paste in the FTP URL for the file, e.g. an NCBI FTP URL.
I wasn't aware that the Upload data tool could take a FTP URL, so thanks for letting me know.
Unfortunately that doesn't take a wild card.
I need to have the path specification like this " ftp://ftp.ncbi.nih.gov/genbank/gbest*.seq.gz" at the minimum.
Actually my tool is more versatile (though I don't need it for this particular application).
I could specify
ftp://ftp.ncbi.nih.gov/genomes/*/*/NC_*.fna
and grab all the fasta files for all chromosome of all species under the genomes directory. I thought it would be a nice tool to have in my galaxy arsenal.
It looks to me like a good idea to check out how the upload tool is implemented. But it seems a bit complex. I don't understand why it does not have the <outputs> tag, it also has this action tab <action module="galaxy.tools.actions.upload" class="UploadToolAction"/> which is not explained in the "Tool Config Syntax". Any documentations or tutorials out there that would help me understand how to implement this? Timothy
Timothy Wu wrote:
On Thu, Sep 15, 2011 at 5:32 PM, Timothy Wu <2huggie@gmail.com> wrote:
On Thu, Sep 15, 2011 at 4:58 PM, Peter Cock <p.j.a.cock@googlemail.com>wrote:
On Thu, Sep 15, 2011 at 9:32 AM, Timothy Wu <2huggie@gmail.com> wrote:
I think I need some kind of "data source" implementation that allow user
to
obtain the data themselves. However with the current tool XML definition, I don't know how to have a FTP download tool to download EST data from NCBI to Galaxy directly.
Perhaps I have misunderstood you, but I'd just use the provided "Upload Data" tool, and paste in the FTP URL for the file, e.g. an NCBI FTP URL.
I wasn't aware that the Upload data tool could take a FTP URL, so thanks for letting me know.
Unfortunately that doesn't take a wild card.
I need to have the path specification like this " ftp://ftp.ncbi.nih.gov/genbank/gbest*.seq.gz" at the minimum.
Actually my tool is more versatile (though I don't need it for this particular application).
I could specify
ftp://ftp.ncbi.nih.gov/genomes/*/*/NC_*.fna
and grab all the fasta files for all chromosome of all species under the genomes directory. I thought it would be a nice tool to have in my galaxy arsenal.
It looks to me like a good idea to check out how the upload tool is implemented. But it seems a bit complex. I don't understand why it does not have the <outputs> tag, it also has this action tab <action module="galaxy.tools.actions.upload" class="UploadToolAction"/> which is not explained in the "Tool Config Syntax".
Any documentations or tutorials out there that would help me understand how to implement this?
Timothy
Hi Timothy, The default action taken when executing a tool is to call the execute() method in lib/galaxy/tools/actions/__init__.py in DefaultToolAction. This method prepares the tool and creates a job to run the tool. The upload tool is unlike other tools and can't use this default method, so it instead uses an action in upload.py in the same directory. This would probably be easiest as a new tool, however, since it's a very specialized case. --nate
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
participants (4)
-
Hans-Rudolf Hotz
-
Nate Coraor
-
Peter Cock
-
Timothy Wu