Setting up microbial_data.loc given a mirror of NCBI FTP site
Hi all, I'd like to be able to use the "Get Microbial Data" tool in our local Galaxy install, which appears as though it could allow access to a local copy of the NCBI "Bacteria" FTP site, ftp://ftp.ncbi.nih.gov/genomes/Bacteria/
From looking at the tool's source code, I see I must populate microbial_data.loc file, however the microbial_data.loc.sample is not very helpful:
#This is a sample file distributed with Galaxy that enables tools #to retrieve microbial data via a URL # #... What this doesn't tell me is the meaning of the columns. Apparently this is really three tables in one, determined by the first entry. ORG entries are used by this tool for the selection of the kingdom and species. They appear to have the following columns, one per species: 0. The "ORG" column itself, not counted in the XML offsets 1. Identifier 2. Species 3. Kingdom 4. Group 5. Comma separated list of chromosomes/plasmids 6. URL for NCBI genome project The CHR entries don't seem to be used directly by this tool. There is one entry per chromosome/plasmid. 0. The "CHR" entry, not counted in the XML offsets 1. Identifier 2. Description including species and chromosome/plasmid 4. Length of sequence (nucleotides) 5. GI number 6. None 7. URL for NCBI nucleotide database Then there are the DATA entries, which appear to reference local files. There are multiple DATA entries per CHR entry: 0. The "DATA" entry, not counted in the XML offsets 1. Identifier (composite of ORG id, CHR id, and data type) 2. Identifier of ORG line 3. Identifier of CHR line 4. Data type (CDS, tRNA, rRNA, sequence, GeneMark, Glimmer3) 5. File format (fasta or bed) 6. Filename Want I want to do is generate a microbial_data.loc file from a local mirror of ftp://ftp.ncbi.nih.gov/genomes/Bacteria/ In addition to understanding the loc file format, it also seems I need to generate some bed files from the NCBI provided data, e.g. for NC_008265 which is one of the examples in the sample loc files, I'd need the following files: NC_008265.CDS.bed NC_008265.tRNA.bed NC_008265.rRNA.bed NC_008265.fna NC_008265.GeneMark.bed NC_008265.GeneMarkHMM.bed NC_008265.Glimmer3.bed Referring to the NCBI FTP site for this organism, we have: NC_008265.GeneMark-2.5m NC_008265.GeneMarkHMM-2.6r NC_008265.Glimmer3 NC_008265.Prodigal-2.50 NC_008265.asn NC_008265.faa NC_008265.ffn NC_008265.fna NC_008265.frn NC_008265.gbk NC_008265.gff NC_008265.ptt NC_008265.rnt NC_008265.rpt NC_008265.val See ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Clostridium_perfringens_SM101_uid58117/ I can see for example how to map *.ptt (protein tables) into *.CDS.bed, and similarly for the Glimmer3 and GeneMark predictions. I could also probably parse *.gbk to generate bed tabular files for any annotated tRNA and rRNA entries (and the CDS entries of course.). But rather than reinventing the wheel, how do you do this at Penn State? Also, I'd like to offer access to the chromosome, CDS, tRNA, and rRNA sequences themselves (as FASTA files, not just bed tabular). Am I right that currently the "Get Microbial Data" tool doesn't offer this? Thanks, Peter
Peter, I guess you have seen the get ftp data scripts for microbial data in galaxy_central/scripts/microbes? I have been struggling with getting this to work (even remote and not local yet) but it failed. Took me too much time to really dig into getting it to work. So I hope you will succeed! For what its worth.... Alex -----Oorspronkelijk bericht----- Van: galaxy-dev-bounces@lists.bx.psu.edu [mailto:galaxy-dev-bounces@lists.bx.psu.edu] Namens Peter Verzonden: maandag 20 december 2010 4:11 Aan: galaxy-dev@lists.bx.psu.edu Onderwerp: [galaxy-dev] Setting up microbial_data.loc given a mirror of NCBI FTP site Hi all, I'd like to be able to use the "Get Microbial Data" tool in our local Galaxy install, which appears as though it could allow access to a local copy of the NCBI "Bacteria" FTP site, ftp://ftp.ncbi.nih.gov/genomes/Bacteria/
From looking at the tool's source code, I see I must populate microbial_data.loc file, however the microbial_data.loc.sample is not very helpful:
#This is a sample file distributed with Galaxy that enables tools #to retrieve microbial data via a URL # #... What this doesn't tell me is the meaning of the columns. Apparently this is really three tables in one, determined by the first entry. ORG entries are used by this tool for the selection of the kingdom and species. They appear to have the following columns, one per species: 0. The "ORG" column itself, not counted in the XML offsets 1. Identifier 2. Species 3. Kingdom 4. Group 5. Comma separated list of chromosomes/plasmids 6. URL for NCBI genome project The CHR entries don't seem to be used directly by this tool. There is one entry per chromosome/plasmid. 0. The "CHR" entry, not counted in the XML offsets 1. Identifier 2. Description including species and chromosome/plasmid 4. Length of sequence (nucleotides) 5. GI number 6. None 7. URL for NCBI nucleotide database Then there are the DATA entries, which appear to reference local files. There are multiple DATA entries per CHR entry: 0. The "DATA" entry, not counted in the XML offsets 1. Identifier (composite of ORG id, CHR id, and data type) 2. Identifier of ORG line 3. Identifier of CHR line 4. Data type (CDS, tRNA, rRNA, sequence, GeneMark, Glimmer3) 5. File format (fasta or bed) 6. Filename Want I want to do is generate a microbial_data.loc file from a local mirror of ftp://ftp.ncbi.nih.gov/genomes/Bacteria/ In addition to understanding the loc file format, it also seems I need to generate some bed files from the NCBI provided data, e.g. for NC_008265 which is one of the examples in the sample loc files, I'd need the following files: NC_008265.CDS.bed NC_008265.tRNA.bed NC_008265.rRNA.bed NC_008265.fna NC_008265.GeneMark.bed NC_008265.GeneMarkHMM.bed NC_008265.Glimmer3.bed Referring to the NCBI FTP site for this organism, we have: NC_008265.GeneMark-2.5m NC_008265.GeneMarkHMM-2.6r NC_008265.Glimmer3 NC_008265.Prodigal-2.50 NC_008265.asn NC_008265.faa NC_008265.ffn NC_008265.fna NC_008265.frn NC_008265.gbk NC_008265.gff NC_008265.ptt NC_008265.rnt NC_008265.rpt NC_008265.val See ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Clostridium_perfringens_SM101_uid58117/ I can see for example how to map *.ptt (protein tables) into *.CDS.bed, and similarly for the Glimmer3 and GeneMark predictions. I could also probably parse *.gbk to generate bed tabular files for any annotated tRNA and rRNA entries (and the CDS entries of course.). But rather than reinventing the wheel, how do you do this at Penn State? Also, I'd like to offer access to the chromosome, CDS, tRNA, and rRNA sequences themselves (as FASTA files, not just bed tabular). Am I right that currently the "Get Microbial Data" tool doesn't offer this? Thanks, Peter _______________________________________________ galaxy-dev mailing list galaxy-dev@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-dev
Hi Peter, As Alex has pointed out, the scripts that were used to create this data are available under scripts/microbes/ and there is a README.txt file available there as well. However, it has been some time since these scripts have been used and they have become stale. They would require some real amount of tweaking to get working properly again (there was some messy webpage scraping against the NCBI microbial genomes project page involved), but we don't have the resources or plan to do this now. At this point we are moving towards removing this tool from our main server (it has already been removed from tool_conf.xml.main), but would be more than willing to reincorporate a working version of the retrieval and parsing scripts. However, this tool predates Library functionality, which is much better suited for providing access to static precached datasets. I can take a look into assembling a compressed file which contains the data and location file currently used by the main server if you are interested, but it should be noted that the tool itself has developed some quirks introduced by some unknown changesets that prevent it from work entirely properly (e.g. selecting multiple datasets at once). Thanks for using Galaxy, Dan On Dec 20, 2010, at 2:15 PM, Bossers, Alex wrote:
Peter, I guess you have seen the get ftp data scripts for microbial data in galaxy_central/scripts/microbes? I have been struggling with getting this to work (even remote and not local yet) but it failed. Took me too much time to really dig into getting it to work. So I hope you will succeed! For what its worth.... Alex
-----Oorspronkelijk bericht----- Van: galaxy-dev-bounces@lists.bx.psu.edu [mailto:galaxy-dev-bounces@lists.bx.psu.edu] Namens Peter Verzonden: maandag 20 december 2010 4:11 Aan: galaxy-dev@lists.bx.psu.edu Onderwerp: [galaxy-dev] Setting up microbial_data.loc given a mirror of NCBI FTP site
Hi all,
I'd like to be able to use the "Get Microbial Data" tool in our local Galaxy install, which appears as though it could allow access to a local copy of the NCBI "Bacteria" FTP site, ftp://ftp.ncbi.nih.gov/genomes/Bacteria/
From looking at the tool's source code, I see I must populate microbial_data.loc file, however the microbial_data.loc.sample is not very helpful:
#This is a sample file distributed with Galaxy that enables tools #to retrieve microbial data via a URL # #...
What this doesn't tell me is the meaning of the columns. Apparently this is really three tables in one, determined by the first entry.
ORG entries are used by this tool for the selection of the kingdom and species. They appear to have the following columns, one per species:
0. The "ORG" column itself, not counted in the XML offsets 1. Identifier 2. Species 3. Kingdom 4. Group 5. Comma separated list of chromosomes/plasmids 6. URL for NCBI genome project
The CHR entries don't seem to be used directly by this tool. There is one entry per chromosome/plasmid.
0. The "CHR" entry, not counted in the XML offsets 1. Identifier 2. Description including species and chromosome/plasmid 4. Length of sequence (nucleotides) 5. GI number 6. None 7. URL for NCBI nucleotide database
Then there are the DATA entries, which appear to reference local files. There are multiple DATA entries per CHR entry:
0. The "DATA" entry, not counted in the XML offsets 1. Identifier (composite of ORG id, CHR id, and data type) 2. Identifier of ORG line 3. Identifier of CHR line 4. Data type (CDS, tRNA, rRNA, sequence, GeneMark, Glimmer3) 5. File format (fasta or bed) 6. Filename
Want I want to do is generate a microbial_data.loc file from a local mirror of ftp://ftp.ncbi.nih.gov/genomes/Bacteria/
In addition to understanding the loc file format, it also seems I need to generate some bed files from the NCBI provided data, e.g. for NC_008265 which is one of the examples in the sample loc files, I'd need the following files:
NC_008265.CDS.bed NC_008265.tRNA.bed NC_008265.rRNA.bed NC_008265.fna NC_008265.GeneMark.bed NC_008265.GeneMarkHMM.bed NC_008265.Glimmer3.bed
Referring to the NCBI FTP site for this organism, we have:
NC_008265.GeneMark-2.5m NC_008265.GeneMarkHMM-2.6r NC_008265.Glimmer3 NC_008265.Prodigal-2.50 NC_008265.asn NC_008265.faa NC_008265.ffn NC_008265.fna NC_008265.frn NC_008265.gbk NC_008265.gff NC_008265.ptt NC_008265.rnt NC_008265.rpt NC_008265.val
See ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Clostridium_perfringens_SM101_uid58117/
I can see for example how to map *.ptt (protein tables) into *.CDS.bed, and similarly for the Glimmer3 and GeneMark predictions. I could also probably parse *.gbk to generate bed tabular files for any annotated tRNA and rRNA entries (and the CDS entries of course.).
But rather than reinventing the wheel, how do you do this at Penn State?
Also, I'd like to offer access to the chromosome, CDS, tRNA, and rRNA sequences themselves (as FASTA files, not just bed tabular). Am I right that currently the "Get Microbial Data" tool doesn't offer this?
Thanks,
Peter _______________________________________________ galaxy-dev mailing list galaxy-dev@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-dev
_______________________________________________ galaxy-dev mailing list galaxy-dev@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-dev
On Mon, Dec 20, 2010 at 9:42 PM, Daniel Blankenberg <dan@bx.psu.edu> wrote:
Hi Peter,
As Alex has pointed out, the scripts that were used to create this data are available under scripts/microbes/ and there is a README.txt file available there as well.
Thank you both for pointing that out - I hadn't found it yet. Adding a mention of this to the microbial_data.loc.sample file would have helped ;)
However, it has been some time since these scripts have been used and they have become stale. They would require some real amount of tweaking to get working properly again (there was some messy webpage scraping against the NCBI microbial genomes project page involved), but we don't have the resources or plan to do this now.
Oh. That's a shame - but a little tweaking I'm willing to attempt.
At this point we are moving towards removing this tool from our main server (it has already been removed from tool_conf.xml.main), but would be more than willing to reincorporate a working version of the retrieval and parsing scripts. However, this tool predates Library functionality, which is much better suited for providing access to static precached datasets. I can take a look into assembling a compressed file which contains the data and location file currently used by the main server if you are interested, but it should be noted that the tool itself has developed some quirks introduced by some unknown changesets that prevent it from work entirely properly (e.g. selecting multiple datasets at once).
I don't know much about the library functionality yet (pointers to docs welcome), but this could be useful to us so I'll try to make time to look at it. A copy of the current live microbial_data.loc file would be very helpful, along with a set of data files for one or maybe two organisms (e.g. NC_000913, E. coli K12, and NC_005213, Nanoarchaeum equitans, small but an interesting test case as it has a gene spanning the origin) Thanks, Peter
On Mon, Dec 20, 2010 at 9:52 PM, Peter <peter@maubp.freeserve.co.uk> wrote:
On Mon, Dec 20, 2010 at 9:42 PM, Daniel Blankenberg <dan@bx.psu.edu> wrote:
Hi Peter,
As Alex has pointed out, the scripts that were used to create this data are available under scripts/microbes/ and there is a README.txt file available there as well.
Thank you both for pointing that out - I hadn't found it yet. Adding a mention of this to the microbial_data.loc.sample file would have helped ;)
However, it has been some time since these scripts have been used and they have become stale. They would require some real amount of tweaking to get working properly again (there was some messy webpage scraping against the NCBI microbial genomes project page involved), but we don't have the resources or plan to do this now.
Oh. That's a shame - but a little tweaking I'm willing to attempt.
I immediately identified a small tweak required, the current extension for the GeneMark files has changed. However, as you suspected, there was a problem before that issue would ever crop up. Looking at the scripts/microbes/harvest_bacteria.py file I see that is is parsing the HTML from this very handy NCBI page: http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi And it does look like the HTML parser needs a little tweaking :( I remember that page used to have an option for plain text tabular output, I've just tried a few likely arguments in the URL but didn't find it. Another option is to use the FTP site, in particular: ftp://ftp.ncbi.nih.gov/genomes/Bacteria/lproks_0.txt (all) ftp://ftp.ncbi.nih.gov/genomes/Bacteria/lproks_1.txt (complete) ftp://ftp.ncbi.nih.gov/genomes/Bacteria/lproks_2.txt (in progress) ftp://ftp.ncbi.nih.gov/genomes/Bacteria/summary.txt (chromosomes and plasmids) The only catch is these files don't seem to include the FTP folder names (which have changed recently to include the uid at the end), which must be inferred from the name (mapping dodgy characters to underscores). I've make a branch here, so far just a few commits to update the harvest_bacteria.py script and clarify the tool output in the XML file, which I hope you'll consider transplanting or reworking for inclusion on the trunk later [I'm still testing things for now]: https://bitbucket.org/peterjc/galaxy-central/src/microbes Regards, Peter
On Tue, Dec 21, 2010 at 1:44 PM, Peter <peter@maubp.freeserve.co.uk> wrote:
On Mon, Dec 20, 2010 at 9:52 PM, Peter <peter@maubp.freeserve.co.uk> wrote:
On Mon, Dec 20, 2010 at 9:42 PM, Daniel Blankenberg <dan@bx.psu.edu> wrote:
Hi Peter,
As Alex has pointed out, the scripts that were used to create this data are available under scripts/microbes/ and there is a README.txt file available there as well.
Thank you both for pointing that out - I hadn't found it yet. Adding a mention of this to the microbial_data.loc.sample file would have helped ;)
However, it has been some time since these scripts have been used and they have become stale. They would require some real amount of tweaking to get working properly again (there was some messy webpage scraping against the NCBI microbial genomes project page involved), but we don't have the resources or plan to do this now.
Oh. That's a shame - but a little tweaking I'm willing to attempt.
I immediately identified a small tweak required, ... I've make a branch here, so far just a few commits ... [I'm still testing things for now]: https://bitbucket.org/peterjc/galaxy-central/src/microbes
Hi Dan, I have another question - why does harvest_bacteria.py etc use project IDs as the folder names (numbers) rather than using the same names as the NCBI (species names with underscores, plus in recent months a suffix of the uid)? If you have opted to match the NCBI tree, then it would be easy to fetch all the GenBank files, all the GeneMark files etc using the provided tar balls: ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.gbk.tar.gz ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.GeneMark.tar.gz etc I've started reading about Data Libraries on the wiki: https://bitbucket.org/galaxy/galaxy-central/wiki/DataLibraries/Libraries https://bitbucket.org/galaxy/galaxy-central/wiki/DataLibraries/UploadingFile... https://bitbucket.org/galaxy/galaxy-central/wiki/DataLibraries/Tutorial/Data... Are there any nice examples of tools/scripts which populate Galaxy Data Libraries automatically which you think it would be helpful to read? Peter
Peter;
I've started reading about Data Libraries on the wiki: https://bitbucket.org/galaxy/galaxy-central/wiki/DataLibraries/Libraries https://bitbucket.org/galaxy/galaxy-central/wiki/DataLibraries/UploadingFile... https://bitbucket.org/galaxy/galaxy-central/wiki/DataLibraries/Tutorial/Data...
Are there any nice examples of tools/scripts which populate Galaxy Data Libraries automatically which you think it would be helpful to read?
You can use the API for this. Here's a script that build data libraries for next gen sequencing runs: https://github.com/chapmanb/bcbb/blob/master/nextgen/scripts/upload_to_galax... It selects files of interest, organizes them into a local directory structure, and then uploads them to Galaxy. Folders are created via the API, and this all uses a thin wrapper: https://github.com/chapmanb/bcbb/blob/master/nextgen/bcbio/galaxy/api.py Brad
On Wed, Dec 22, 2010 at 3:49 PM, Brad Chapman <chapmanb@50mail.com> wrote:
Peter;
I've started reading about Data Libraries on the wiki: https://bitbucket.org/galaxy/galaxy-central/wiki/DataLibraries/Libraries https://bitbucket.org/galaxy/galaxy-central/wiki/DataLibraries/UploadingFile... https://bitbucket.org/galaxy/galaxy-central/wiki/DataLibraries/Tutorial/Data...
Are there any nice examples of tools/scripts which populate Galaxy Data Libraries automatically which you think it would be helpful to read?
You can use the API for this. Here's a script that build data libraries for next gen sequencing runs:
https://github.com/chapmanb/bcbb/blob/master/nextgen/scripts/upload_to_galax...
It selects files of interest, organizes them into a local directory structure, and then uploads them to Galaxy. Folders are created via the API, and this all uses a thin wrapper:
https://github.com/chapmanb/bcbb/blob/master/nextgen/bcbio/galaxy/api.py
Brad
That looks very handy Brad - thank you :) What I'm not clear on yet is how to structure the libraries - in particular can I associate a genome with a library, or with each file in a library? If I go with one library per bacteria/archaea, how well would Galaxy cope with 800+ libraries? If I go with one *big* library for all the NCBI RefSeq bacteria/archaea, using a folder structure inside the library, how easy will it be for the user to find a particular genome. [We'd probably want to extend this to other NCBI RefSeq genomes later, e.g. plants, fungi and some animals] I guess I'll have to experiment, but I imagine Dan has thought about this already and may have some advice. Cheers, Peter
Hi Peter, I managed to get the microbial data compressed and it is available along with the .loc file here http://www.bx.psu.edu/~dan/microbes/test_only/all/. The mention of using Libraries instead of a tool was primarily due to the static content of what is being made available by the tool. Currently the tool will create a new file on disk each time a user requests a dataset, with libraries each user would have a pointer to the same file on disk. I had been thinking of arranging the libraries similar to how they are in the tool, but perhaps with additional sub-categories; although as the number of genomes (greatly) increases I'm not entirely certain how the interface would scale. Maybe some UI enhancements to the libraries could make it more manageable. Libraries also have the added benefit of allowing some versioning of datasets. There does seem to be some interest in this tool/data, so after the new year I will try to find some time take another pass through it. It appears that you've looked through it quite a bit, so I'll definitely be using your recent efforts/notes to help with this (if you haven't gotten it all fixed and working by then;) ). Thanks, Dan On Dec 22, 2010, at 11:02 AM, Peter wrote:
On Wed, Dec 22, 2010 at 3:49 PM, Brad Chapman <chapmanb@50mail.com> wrote:
Peter;
I've started reading about Data Libraries on the wiki: https://bitbucket.org/galaxy/galaxy-central/wiki/DataLibraries/Libraries https://bitbucket.org/galaxy/galaxy-central/wiki/DataLibraries/UploadingFile... https://bitbucket.org/galaxy/galaxy-central/wiki/DataLibraries/Tutorial/Data...
Are there any nice examples of tools/scripts which populate Galaxy Data Libraries automatically which you think it would be helpful to read?
You can use the API for this. Here's a script that build data libraries for next gen sequencing runs:
https://github.com/chapmanb/bcbb/blob/master/nextgen/scripts/upload_to_galax...
It selects files of interest, organizes them into a local directory structure, and then uploads them to Galaxy. Folders are created via the API, and this all uses a thin wrapper:
https://github.com/chapmanb/bcbb/blob/master/nextgen/bcbio/galaxy/api.py
Brad
That looks very handy Brad - thank you :)
What I'm not clear on yet is how to structure the libraries - in particular can I associate a genome with a library, or with each file in a library?
If I go with one library per bacteria/archaea, how well would Galaxy cope with 800+ libraries?
If I go with one *big* library for all the NCBI RefSeq bacteria/archaea, using a folder structure inside the library, how easy will it be for the user to find a particular genome.
[We'd probably want to extend this to other NCBI RefSeq genomes later, e.g. plants, fungi and some animals]
I guess I'll have to experiment, but I imagine Dan has thought about this already and may have some advice.
Cheers,
Peter _______________________________________________ galaxy-dev mailing list galaxy-dev@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-dev
On Thu, Dec 23, 2010 at 3:47 PM, Daniel Blankenberg <dan@bx.psu.edu> wrote:
Hi Peter,
I managed to get the microbial data compressed and it is available along with the .loc file here http://www.bx.psu.edu/~dan/microbes/test_only/all/.
Great - can you keep that online till mid Jan? I'll try to download it in early Jan... if you could extract a species or two on their own it would be less of a download (and I'd be tempted to take a peak from home).
The mention of using Libraries instead of a tool was primarily due to the static content of what is being made available by the tool. Currently the tool will create a new file on disk each time a user requests a dataset, with libraries each user would have a pointer to the same file on disk. I had been thinking of arranging the libraries similar to how they are in the tool, but perhaps with additional sub-categories; although as the number of genomes (greatly) increases I'm not entirely certain how the interface would scale. Maybe some UI enhancements to the libraries could make it more manageable. Libraries also have the added benefit of allowing some versioning of datasets.
There does seem to be some interest in this tool/data, so after the new year I will try to find some time take another pass through it. It appears that you've looked through it quite a bit, so I'll definitely be using your recent efforts/notes to help with this (if you haven't gotten it all fixed and working by then;) ).
Well, I'm afraid I won't be doing any more on this until the new year, but then I'll be happy to help test/guide/give feedback on any updates. Thank you, and Happy Christmas & New Year, Peter
participants (4)
-
Bossers, Alex
-
Brad Chapman
-
Daniel Blankenberg
-
Peter