Re: [galaxy-dev] NCBI BLAST+ wrappers in Galaxy?

18 Nov 2010

      This changeset undoes part of the change to data tables in my commit,  
which was not an accident. The blastdb.loc.sample and  
blastdb_p.loc.sample now do not match the columns expected in the  
ncbi_blast_plus tools. Megablast uses the blastdb.loc file and it  
expects it to match the spec in tool_data_table_conf.xml.

Data tables is far more flexible than the raw loc approach, which is  
why we changed it. The unique ID is necessary for the data tables  
approach and allows for the structure of the loc file and/or data  
location to change without breaking things. The former approach was to  
store the path as the value for the parameter. This means that if it  
was set in a workflow and if that path changed (i.e. the data  
directory was restructured), the workflow would no longer work.  
However, if we use the unique ID, it's possible to maintain backwards  
compatibility. Instead of the path, it stores the unique ID, which can  
be used to obtain the path so that it can be passed to the Python  
file. And for items that were already in the loc file, you set the  
unique ID to be the same as the original path, so that the parameter  
values in existing workflows is still the same. But new items can have  
nicer-looking IDs. And if extra columns ever need to be added, it's  
easy. Ever since James' original data tables commit (in August), we  
have been wanting to change everything over to this style, so I am  
going to change these files back.

If you don't want to reformat the loc files, just use the  
tool_data_table_conf.xml.oldlocstyle instead of  
tool_data_table_conf.xml.sample as the source for  
tool_data_table_conf.xml. This is where the columns are defined, and  
it's just a matter of defining name and value. (In case you're not  
seeing it, it's not showing up in one of my clones for some reason,  
but it is definitely in the repository.)

I'm working on a wiki page that will explain data tables, since  
they're pretty much undocumented at this point.

Kelly

On Nov 16, 2010, at 11:43 AM, Peter wrote:
...
On Fri, Nov 12, 2010 at 2:05 AM, Kanwei Li <kanwei@gmail.com> wrote:
...
All changesets in the please_merge branch have been merged.
Thanks for the contribution!
-Kanwei
Hi Kanwei & Kelly,
I've just updated my test installation of Galaxy and realised that
there is a problem with the loc file handling for BLAST+ due to
this commit from Kelly Vincent:
"Converted several tools to data table style of loc file handling
(Bowtie, BWA, Lastz, Megablast, PerM, SRMA). Cleaned up
several tool XML files, removing unnecessary None parameters."
http://bitbucket.org/galaxy/galaxy-central/changeset/535d276c92bc
When I wrote the BLAST+ wrappers, blastdb.loc (for nucleotides)
used two columns only (caption and path). Likewise for the
introduced blastdb_p.loc file.
The legacy megablast_wrapper.xml treated the first word of the
caption as an ID and passed it to megablast_wrapper.py which
used the loc file to look up the real path to use to call blastall.
This seems convoluted to me.
For my BLAST+ wrappers I just need the caption (to show to
the user) and the path (to use at the command line), which
were column indices 0 and 1 (python counting), thus:
<options from_file="blastdb.loc">
                     <column name="name" index="1"/>
                     <column name="value" index="2"/>
                   </options>
Then came this patch, from Kelly Vincent:
"Converted several tools to data table style of loc file handling
(Bowtie, BWA, Lastz, Megablast, PerM, SRMA). Cleaned up
several tool XML files, removing unnecessary None parameters."
http://bitbucket.org/galaxy/galaxy-central/changeset/535d276c92bc
After this patch, the blastdb.loc and blastdb_p.loc files have
three columns (id, caption, path), with the recommendation
that if you were using the old megablast_wrapper.xml then
pick the first word of the caption as the id (for backwards
compatibility).
The XML for the BLAST+ wrappers now (wrongly) uses this,
<options from_file="blastdb.loc">
                     <column name="name" index="2"/>
                     <column name="value" index="0"/>
                   </options>
That means the name shown to the users is column 2
(in python speak, i.e. the third column) which is the path (!)
and the value used to call the executable is column 0
(in python speak, i.e. the first column) which is the new
identifier column.
Is it possible that this would run, but only if the identifier
was actually the name of a valid blast database (e.g. nr)
which was on the blast database path. Maybe that is
the case on Kelly's machine?
What it should be using is column indexes 1 and 2
(for the caption and path, ignoring the new id column):
<options from_file="blastdb.loc">
                     <column name="name" index="1"/>
                     <column name="value" index="2"/>
                   </options>
This is done in the following changeset:
http://bitbucket.org/peterjc/galaxy-central/changeset/6b499b39b804
Could one of you apply that please?
I'd also like to know why the extra ID column was added - I
don't understand what it is for. Can we remove it again?
Regards,
Peter