On Mon, Dec 20, 2010 at 9:52 PM, Peter <peter@maubp.freeserve.co.uk> wrote:
On Mon, Dec 20, 2010 at 9:42 PM, Daniel Blankenberg <dan@bx.psu.edu> wrote:
Hi Peter,
As Alex has pointed out, the scripts that were used to create this data are available under scripts/microbes/ and there is a README.txt file available there as well.
Thank you both for pointing that out - I hadn't found it yet. Adding a mention of this to the microbial_data.loc.sample file would have helped ;)
However, it has been some time since these scripts have been used and they have become stale. They would require some real amount of tweaking to get working properly again (there was some messy webpage scraping against the NCBI microbial genomes project page involved), but we don't have the resources or plan to do this now.
Oh. That's a shame - but a little tweaking I'm willing to attempt.
I immediately identified a small tweak required, the current extension for the GeneMark files has changed. However, as you suspected, there was a problem before that issue would ever crop up. Looking at the scripts/microbes/harvest_bacteria.py file I see that is is parsing the HTML from this very handy NCBI page: http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi And it does look like the HTML parser needs a little tweaking :( I remember that page used to have an option for plain text tabular output, I've just tried a few likely arguments in the URL but didn't find it. Another option is to use the FTP site, in particular: ftp://ftp.ncbi.nih.gov/genomes/Bacteria/lproks_0.txt (all) ftp://ftp.ncbi.nih.gov/genomes/Bacteria/lproks_1.txt (complete) ftp://ftp.ncbi.nih.gov/genomes/Bacteria/lproks_2.txt (in progress) ftp://ftp.ncbi.nih.gov/genomes/Bacteria/summary.txt (chromosomes and plasmids) The only catch is these files don't seem to include the FTP folder names (which have changed recently to include the uid at the end), which must be inferred from the name (mapping dodgy characters to underscores). I've make a branch here, so far just a few commits to update the harvest_bacteria.py script and clarify the tool output in the XML file, which I hope you'll consider transplanting or reworking for inclusion on the trunk later [I'm still testing things for now]: https://bitbucket.org/peterjc/galaxy-central/src/microbes Regards, Peter