Hi Peter, as long as the user can choose, and the previously functionality remains the default (for backward compatibility) I'm all for this. On Nov 9, 2010, at 11:34 AM, Peter wrote:
Hello all,
I ran into a problem in a work flow manipulating FASTA and tabular files. I traced this to an unexpected behaviour of the FASTA to tabular converter.
In my experience most command line tools which take FASTA files as input treat the first word after the ">" as the identifier for each FASTA record, and any subsequent text as an optional description. It could then make sense to turn a FASTA file into a three column tabular file (identifier, description, sequence). Currently Galaxy does not make this distinction, so we have just two columns (identifier+description, seq).
Would you all be amenable to my extending this script to allow the user to choose between 2 column output (current behaviour) and 3 column output (splitting the FASTA ">" line at the first white space)?
Alternatively, I have written a less invasive patch to allow an easy way to extract the identifier (first word) and sequence:
http://bitbucket.org/peterjc/galaxy-central/changeset/f57552b4f9fb
Note that currently the converter does allow the ">" line to be trimmed which can achieve the same goal but ONLY when all the identifiers are the same length (rarely the case in my experience).
Similarly, I'd like to extend the tabular to FASTA converter to allow a third column to be selected as the description, giving for example ">c1 c3" as the ">" line, with the sequence coming from c2.
I look forward to comments,
Thanks,
Peter
P.S. All these comments apply equally to the FASTQ to/from tabular converters. _______________________________________________ galaxy-dev mailing list galaxy-dev@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-dev
-- jt James Taylor, Assistant Professor, Biology / Computer Science, Emory University