Re: [galaxy-dev] Setting up microbial_data.loc given a mirror of NCBI FTP site

20 Dec 2010

      Hi Peter,

As Alex has pointed out, the scripts that were used to create this data are available under scripts/microbes/ and there is a README.txt file available there as well.  However, it has been some time since these scripts have been used and they have become stale. They would require some real amount of tweaking to get working properly again (there was some messy webpage scraping against the NCBI microbial genomes project page involved), but we don't have the resources or plan to do this now. 

At this point we are moving towards removing this tool from our main server (it has already been removed from tool_conf.xml.main), but would be more than willing to reincorporate a working version of the retrieval and parsing scripts. However, this tool predates Library functionality, which is much better suited for providing access to static precached datasets. I can take a look into assembling a compressed file which contains the data and location file currently used by the main server if you are interested, but it should be noted that the tool itself has developed some quirks introduced by some unknown changesets that prevent it from work entirely properly (e.g. selecting multiple datasets at once).

Thanks for using Galaxy,

Dan

On Dec 20, 2010, at 2:15 PM, Bossers, Alex wrote:
...
Peter,
I guess you have seen the get ftp data scripts for microbial data in galaxy_central/scripts/microbes? I have been struggling with getting this to work (even remote and not local yet) but it failed. Took me too much time to really dig into getting it to work.
So I hope you will succeed!
For what its worth....
Alex
-----Oorspronkelijk bericht-----
Van: galaxy-dev-bounces@lists.bx.psu.edu [mailto:galaxy-dev-bounces@lists.bx.psu.edu] Namens Peter
Verzonden: maandag 20 december 2010 4:11
Aan: galaxy-dev@lists.bx.psu.edu
Onderwerp: [galaxy-dev] Setting up microbial_data.loc given a mirror of NCBI FTP site
Hi all,
I'd like to be able to use the "Get Microbial Data" tool in our local Galaxy install, which appears as though it could allow access to a local copy of the NCBI "Bacteria" FTP site, ftp://ftp.ncbi.nih.gov/genomes/Bacteria/
...
From looking at the tool's source code, I see I must populate
microbial_data.loc file, however the microbial_data.loc.sample is not very helpful:
#This is a sample file distributed with Galaxy that enables tools #to retrieve microbial data via a URL # #...
What this doesn't tell me is the meaning of the columns. Apparently this is really three tables in one, determined by the first entry.
ORG entries are used by this tool for the selection of the kingdom and species. They appear to have the following columns, one per
species:
0. The "ORG" column itself, not counted in the XML offsets 1. Identifier 2. Species 3. Kingdom 4. Group 5. Comma separated list of chromosomes/plasmids 6. URL for NCBI genome project
The CHR entries don't seem to be used directly by this tool.
There is one entry per chromosome/plasmid.
0. The "CHR" entry, not counted in the XML offsets 1. Identifier 2. Description including species and chromosome/plasmid 4. Length of sequence (nucleotides) 5. GI number 6. None 7. URL for NCBI nucleotide database
Then there are the DATA entries, which appear to reference local files. There are multiple DATA entries per CHR entry:
0. The "DATA" entry, not counted in the XML offsets 1. Identifier (composite of ORG id, CHR id, and data type) 2. Identifier of ORG line 3. Identifier of CHR line 4. Data type (CDS, tRNA, rRNA, sequence, GeneMark, Glimmer3) 5. File format (fasta or bed) 6. Filename
Want I want to do is generate a microbial_data.loc file from a local mirror of ftp://ftp.ncbi.nih.gov/genomes/Bacteria/
In addition to understanding the loc file format, it also seems I need to generate some bed files from the NCBI provided data, e.g. for NC_008265 which is one of the examples in the sample loc files, I'd need the following files:
NC_008265.CDS.bed
NC_008265.tRNA.bed
NC_008265.rRNA.bed
NC_008265.fna
NC_008265.GeneMark.bed
NC_008265.GeneMarkHMM.bed
NC_008265.Glimmer3.bed
Referring to the NCBI FTP site for this organism, we have:
NC_008265.GeneMark-2.5m
NC_008265.GeneMarkHMM-2.6r
NC_008265.Glimmer3
NC_008265.Prodigal-2.50
NC_008265.asn
NC_008265.faa
NC_008265.ffn
NC_008265.fna
NC_008265.frn
NC_008265.gbk
NC_008265.gff
NC_008265.ptt
NC_008265.rnt
NC_008265.rpt
NC_008265.val
See ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Clostridium_perfringens_SM101_uid58117/
I can see for example how to map *.ptt (protein tables) into *.CDS.bed, and similarly for the Glimmer3 and GeneMark predictions. I could also probably parse *.gbk to generate bed tabular files for any annotated tRNA and rRNA entries (and the CDS entries of course.).
But rather than reinventing the wheel, how do you do this at Penn State?
Also, I'd like to offer access to the chromosome, CDS, tRNA, and rRNA sequences themselves (as FASTA files, not just bed tabular). Am I right that currently the "Get Microbial Data" tool doesn't offer this?
Thanks,
Peter
_______________________________________________
galaxy-dev mailing list
galaxy-dev@lists.bx.psu.edu
http://lists.bx.psu.edu/listinfo/galaxy-dev
_______________________________________________
galaxy-dev mailing list
galaxy-dev@lists.bx.psu.edu
http://lists.bx.psu.edu/listinfo/galaxy-dev

Re: [galaxy-dev] Setting up microbial_data.loc given a mirror of NCBI FTP site

Daniel Blankenberg