Hello all,
I ran into a problem in a work flow manipulating FASTA and tabular files. I traced this to an unexpected behaviour of the FASTA to tabular converter.
In my experience most command line tools which take FASTA files as input treat the first word after the ">" as the identifier for each FASTA record, and any subsequent text as an optional description. It could then make sense to turn a FASTA file into a three column tabular file (identifier, description, sequence). Currently Galaxy does not make this distinction, so we have just two columns (identifier+description, seq).
Would you all be amenable to my extending this script to allow the user to choose between 2 column output (current behaviour) and 3 column output (splitting the FASTA ">" line at the first white space)?
Alternatively, I have written a less invasive patch to allow an easy way to extract the identifier (first word) and sequence:
http://bitbucket.org/peterjc/galaxy-central/changeset/f57552b4f9fb
Note that currently the converter does allow the ">" line to be trimmed which can achieve the same goal but ONLY when all the identifiers are the same length (rarely the case in my experience).
Similarly, I'd like to extend the tabular to FASTA converter to allow a third column to be selected as the description, giving for example ">c1 c3" as the ">" line, with the sequence coming from c2.
I look forward to comments,
Thanks,
Peter
P.S. All these comments apply equally to the FASTQ to/from tabular converters.
Hi Peter, as long as the user can choose, and the previously functionality remains the default (for backward compatibility) I'm all for this.
On Nov 9, 2010, at 11:34 AM, Peter wrote:
Hello all,
I ran into a problem in a work flow manipulating FASTA and tabular files. I traced this to an unexpected behaviour of the FASTA to tabular converter.
In my experience most command line tools which take FASTA files as input treat the first word after the ">" as the identifier for each FASTA record, and any subsequent text as an optional description. It could then make sense to turn a FASTA file into a three column tabular file (identifier, description, sequence). Currently Galaxy does not make this distinction, so we have just two columns (identifier+description, seq).
Would you all be amenable to my extending this script to allow the user to choose between 2 column output (current behaviour) and 3 column output (splitting the FASTA ">" line at the first white space)?
Alternatively, I have written a less invasive patch to allow an easy way to extract the identifier (first word) and sequence:
http://bitbucket.org/peterjc/galaxy-central/changeset/f57552b4f9fb
Note that currently the converter does allow the ">" line to be trimmed which can achieve the same goal but ONLY when all the identifiers are the same length (rarely the case in my experience).
Similarly, I'd like to extend the tabular to FASTA converter to allow a third column to be selected as the description, giving for example ">c1 c3" as the ">" line, with the sequence coming from c2.
I look forward to comments,
Thanks,
Peter
P.S. All these comments apply equally to the FASTQ to/from tabular converters. _______________________________________________ galaxy-dev mailing list galaxy-dev@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-dev
-- jt
James Taylor, Assistant Professor, Biology / Computer Science, Emory University
On Tue, Nov 9, 2010 at 7:24 PM, James Taylor wrote:
Hi Peter, as long as the user can choose, and the previously functionality remains the default (for backward compatibility) I'm all for this.
Cool - but which of my proposals did you like (or both)? The option of three columns (id, description, sequence) or stick with two columns but give more flexibility for handling the id/description?
Peter
P.S. I'd have to tweak the changeset to make the old mode the default.
I think the former. Isn't that what you prefer as well? It seems like the more complete solution.
On Nov 9, 2010, at 3:18 PM, Peter wrote:
On Tue, Nov 9, 2010 at 7:24 PM, James Taylor wrote:
Hi Peter, as long as the user can choose, and the previously functionality remains the default (for backward compatibility) I'm all for this.
Cool - but which of my proposals did you like (or both)? The option of three columns (id, description, sequence) or stick with two columns but give more flexibility for handling the id/description?
Peter
P.S. I'd have to tweak the changeset to make the old mode the default.
-- jt
James Taylor, Assistant Professor, Biology / Computer Science, Emory University
On Tue, Nov 9, 2010 at 8:37 PM, James Taylor wrote:
I think the former. Isn't that what you prefer as well? It seems like the more complete solution.
Indeed. So we'd have a choice of two columns (default, with the existing max chars option) or three columns (splits the FASTA ">" line at the first white space).
I'll work on that later this week.
Peter
On Tue, Nov 9, 2010 at 8:42 PM, Peter wrote:
On Tue, Nov 9, 2010 at 8:37 PM, James Taylor wrote:
I think the former. Isn't that what you prefer as well? It seems like the more complete solution.
Indeed. So we'd have a choice of two columns (default, with the existing max chars option) or three columns (splits the FASTA ">" line at the first white space).
I'll work on that later this week.
How about this patch? Having thought about it overnight, I've gone a step further to allow the FASTA title line to be split into any number of columns, the usefulness of which should be demonstrated by the example in the XML help.
http://bitbucket.org/peterjc/galaxy-central/changeset/181614e79ccb
This includes basic unit tests.
Peter
On Wed, Nov 10, 2010 at 12:15 PM, Peter biopython@maubp.freeserve.co.uk wrote:
How about this patch? Having thought about it overnight, I've gone a step further to allow the FASTA title line to be split into any number of columns, the usefulness of which should be demonstrated by the example in the XML help.
http://bitbucket.org/peterjc/galaxy-central/changeset/181614e79ccb
This includes basic unit tests.
Plus this follow up which copes with any tabs in the FASTA title line (converts them to spaces, as done by the Galaxy FASTQ to tabular script).
http://bitbucket.org/peterjc/galaxy-central/changeset/d2fd2defa692
These two change sets are both on this branch:
http://bitbucket.org/peterjc/galaxy-central/src/tabular
Peter
On Wed, Nov 10, 2010 at 3:58 PM, Peter biopython@maubp.freeserve.co.uk wrote:
On Wed, Nov 10, 2010 at 12:15 PM, Peter biopython@maubp.freeserve.co.uk wrote:
How about this patch? Having thought about it overnight, I've gone a step further to allow the FASTA title line to be split into any number of columns, the usefulness of which should be demonstrated by the example in the XML help.
http://bitbucket.org/peterjc/galaxy-central/changeset/181614e79ccb
This includes basic unit tests.
Plus this follow up which copes with any tabs in the FASTA title line (converts them to spaces, as done by the Galaxy FASTQ to tabular script).
http://bitbucket.org/peterjc/galaxy-central/changeset/d2fd2defa692
These two change sets are both on this branch:
http://bitbucket.org/peterjc/galaxy-central/src/tabular
Peter
Hi again,
I've done a matching enhancement to the FASTQ-to-tabular tool, with a new unit test - again on the same branch.
http://bitbucket.org/peterjc/galaxy-central/changeset/52d50566f4af http://bitbucket.org/peterjc/galaxy-central/changeset/8bf1e0b14e4b
Peter
On Wed, Nov 10, 2010 at 4:40 PM, Peter biopython@maubp.freeserve.co.uk wrote:
Hi again,
I've done a matching enhancement to the FASTQ-to-tabular tool, with a new unit test - again on the same branch.
http://bitbucket.org/peterjc/galaxy-central/changeset/52d50566f4af http://bitbucket.org/peterjc/galaxy-central/changeset/8bf1e0b14e4b
Peter
Hi all,
I guess you are probably busy with the planned galaxy-dist release at the moment, but could someone take a look at these proposed changes to the FASTA to tabular and FASTQ to tabular scripts?
http://bitbucket.org/peterjc/galaxy-central/src/tabular
i.e. For FASTA to tabular:
http://bitbucket.org/peterjc/galaxy-central/changeset/181614e79ccb http://bitbucket.org/peterjc/galaxy-central/changeset/d2fd2defa692
then for FASTQ to tabular:
http://bitbucket.org/peterjc/galaxy-central/changeset/52d50566f4af http://bitbucket.org/peterjc/galaxy-central/changeset/8bf1e0b14e4b
I'm happy to make changes if you feel anything needs changing for integration.
This enhancement (or another workaround like my proposed FASTA filter by ID script) is something I need for some in-house workflows doing classification of proteins from FASTA files.
Thanks,
Peter
On Wed, Nov 24, 2010 at 2:14 PM, Peter biopython@maubp.freeserve.co.uk wrote:
Hi all,
I guess you are probably busy with the planned galaxy-dist release at the moment, but could someone take a look at these proposed changes to the FASTA to tabular and FASTQ to tabular scripts?
http://bitbucket.org/peterjc/galaxy-central/src/tabular
i.e. For FASTA to tabular:
http://bitbucket.org/peterjc/galaxy-central/changeset/181614e79ccb http://bitbucket.org/peterjc/galaxy-central/changeset/d2fd2defa692
then for FASTQ to tabular:
http://bitbucket.org/peterjc/galaxy-central/changeset/52d50566f4af http://bitbucket.org/peterjc/galaxy-central/changeset/8bf1e0b14e4b
I'm happy to make changes if you feel anything needs changing for integration.
This enhancement (or another workaround like my proposed FASTA filter by ID script) is something I need for some in-house workflows doing classification of proteins from FASTA files.
Thanks,
Peter
Hi again,
Can anyone take a look at these proposed changes? I don't mind being told you don't want to use them for the official repository (I'll just maintain a local branch or make them into separate tools).
Regards,
Peter
galaxy-dev@lists.galaxyproject.org