Changing the FASTA to tabular converter

Peter

9 Nov 2010 9 Nov '10

4:34 p.m.

Hello all, I ran into a problem in a work flow manipulating FASTA and tabular files. I traced this to an unexpected behaviour of the FASTA to tabular converter. In my experience most command line tools which take FASTA files as input treat the first word after the ">" as the identifier for each FASTA record, and any subsequent text as an optional description. It could then make sense to turn a FASTA file into a three column tabular file (identifier, description, sequence). Currently Galaxy does not make this distinction, so we have just two columns (identifier+description, seq). Would you all be amenable to my extending this script to allow the user to choose between 2 column output (current behaviour) and 3 column output (splitting the FASTA ">" line at the first white space)? Alternatively, I have written a less invasive patch to allow an easy way to extract the identifier (first word) and sequence: http://bitbucket.org/peterjc/galaxy-central/changeset/f57552b4f9fb Note that currently the converter does allow the ">" line to be trimmed which can achieve the same goal but ONLY when all the identifiers are the same length (rarely the case in my experience). Similarly, I'd like to extend the tabular to FASTA converter to allow a third column to be selected as the description, giving for example ">c1 c3" as the ">" line, with the sequence coming from c2. I look forward to comments, Thanks, Peter P.S. All these comments apply equally to the FASTQ to/from tabular converters.

Show replies by date

James Taylor

9 Nov 9 Nov

7:24 p.m.

Hi Peter, as long as the user can choose, and the previously functionality remains the default (for backward compatibility) I'm all for this. On Nov 9, 2010, at 11:34 AM, Peter wrote:

...

Hello all,

I ran into a problem in a work flow manipulating FASTA and tabular files. I traced this to an unexpected behaviour of the FASTA to tabular converter.

In my experience most command line tools which take FASTA files as input treat the first word after the ">" as the identifier for each FASTA record, and any subsequent text as an optional description. It could then make sense to turn a FASTA file into a three column tabular file (identifier, description, sequence). Currently Galaxy does not make this distinction, so we have just two columns (identifier+description, seq).

Would you all be amenable to my extending this script to allow the user to choose between 2 column output (current behaviour) and 3 column output (splitting the FASTA ">" line at the first white space)?

Alternatively, I have written a less invasive patch to allow an easy way to extract the identifier (first word) and sequence:

http://bitbucket.org/peterjc/galaxy-central/changeset/f57552b4f9fb

Note that currently the converter does allow the ">" line to be trimmed which can achieve the same goal but ONLY when all the identifiers are the same length (rarely the case in my experience).

Similarly, I'd like to extend the tabular to FASTA converter to allow a third column to be selected as the description, giving for example ">c1 c3" as the ">" line, with the sequence coming from c2.

I look forward to comments,

Thanks,

Peter

P.S. All these comments apply equally to the FASTQ to/from tabular converters. _______________________________________________ galaxy-dev mailing list galaxy-dev@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-dev

-- jt James Taylor, Assistant Professor, Biology / Computer Science, Emory University

Peter

8:18 p.m.

On Tue, Nov 9, 2010 at 7:24 PM, James Taylor wrote:

...

Hi Peter, as long as the user can choose, and the previously functionality remains the default (for backward compatibility) I'm all for this.

Cool - but which of my proposals did you like (or both)? The option of three columns (id, description, sequence) or stick with two columns but give more flexibility for handling the id/description? Peter P.S. I'd have to tweak the changeset to make the old mode the default.

James Taylor

8:37 p.m.

I think the former. Isn't that what you prefer as well? It seems like the more complete solution. On Nov 9, 2010, at 3:18 PM, Peter wrote:

...

On Tue, Nov 9, 2010 at 7:24 PM, James Taylor wrote:

...
Hi Peter, as long as the user can choose, and the previously functionality remains the default (for backward compatibility) I'm all for this.

Cool - but which of my proposals did you like (or both)? The option of three columns (id, description, sequence) or stick with two columns but give more flexibility for handling the id/description?

Peter

P.S. I'd have to tweak the changeset to make the old mode the default.

-- jt James Taylor, Assistant Professor, Biology / Computer Science, Emory University

Peter

8:42 p.m.

On Tue, Nov 9, 2010 at 8:37 PM, James Taylor wrote:

...

I think the former. Isn't that what you prefer as well? It seems like the more complete solution.

Indeed. So we'd have a choice of two columns (default, with the existing max chars option) or three columns (splits the FASTA ">" line at the first white space). I'll work on that later this week. Peter

Peter

10 Nov 10 Nov

12:15 p.m.

On Tue, Nov 9, 2010 at 8:42 PM, Peter wrote:

...

On Tue, Nov 9, 2010 at 8:37 PM, James Taylor wrote:

...
I think the former. Isn't that what you prefer as well? It seems like the more complete solution.

Indeed. So we'd have a choice of two columns (default, with the existing max chars option) or three columns (splits the FASTA ">" line at the first white space).

I'll work on that later this week.

How about this patch? Having thought about it overnight, I've gone a step further to allow the FASTA title line to be split into any number of columns, the usefulness of which should be demonstrated by the example in the XML help. http://bitbucket.org/peterjc/galaxy-central/changeset/181614e79ccb This includes basic unit tests. Peter

Peter

3:58 p.m.

On Wed, Nov 10, 2010 at 12:15 PM, Peter <biopython@maubp.freeserve.co.uk> wrote:

...

How about this patch? Having thought about it overnight, I've gone a step further to allow the FASTA title line to be split into any number of columns, the usefulness of which should be demonstrated by the example in the XML help.

http://bitbucket.org/peterjc/galaxy-central/changeset/181614e79ccb

This includes basic unit tests.

Plus this follow up which copes with any tabs in the FASTA title line (converts them to spaces, as done by the Galaxy FASTQ to tabular script). http://bitbucket.org/peterjc/galaxy-central/changeset/d2fd2defa692 These two change sets are both on this branch: http://bitbucket.org/peterjc/galaxy-central/src/tabular Peter

Peter

4:40 p.m.

On Wed, Nov 10, 2010 at 3:58 PM, Peter <biopython@maubp.freeserve.co.uk> wrote:

...

On Wed, Nov 10, 2010 at 12:15 PM, Peter <biopython@maubp.freeserve.co.uk> wrote:

...
How about this patch? Having thought about it overnight, I've gone a step further to allow the FASTA title line to be split into any number of columns, the usefulness of which should be demonstrated by the example in the XML help.

http://bitbucket.org/peterjc/galaxy-central/changeset/181614e79ccb

This includes basic unit tests.

Plus this follow up which copes with any tabs in the FASTA title line (converts them to spaces, as done by the Galaxy FASTQ to tabular script).

http://bitbucket.org/peterjc/galaxy-central/changeset/d2fd2defa692

These two change sets are both on this branch:

http://bitbucket.org/peterjc/galaxy-central/src/tabular

Peter

Hi again, I've done a matching enhancement to the FASTQ-to-tabular tool, with a new unit test - again on the same branch. http://bitbucket.org/peterjc/galaxy-central/changeset/52d50566f4af http://bitbucket.org/peterjc/galaxy-central/changeset/8bf1e0b14e4b Peter

Peter

24 Nov 24 Nov

2:14 p.m.

On Wed, Nov 10, 2010 at 4:40 PM, Peter <biopython@maubp.freeserve.co.uk> wrote:

...

Hi again,

I've done a matching enhancement to the FASTQ-to-tabular tool, with a new unit test - again on the same branch.

http://bitbucket.org/peterjc/galaxy-central/changeset/52d50566f4af http://bitbucket.org/peterjc/galaxy-central/changeset/8bf1e0b14e4b

Peter

Hi all, I guess you are probably busy with the planned galaxy-dist release at the moment, but could someone take a look at these proposed changes to the FASTA to tabular and FASTQ to tabular scripts? http://bitbucket.org/peterjc/galaxy-central/src/tabular i.e. For FASTA to tabular: http://bitbucket.org/peterjc/galaxy-central/changeset/181614e79ccb http://bitbucket.org/peterjc/galaxy-central/changeset/d2fd2defa692 then for FASTQ to tabular: http://bitbucket.org/peterjc/galaxy-central/changeset/52d50566f4af http://bitbucket.org/peterjc/galaxy-central/changeset/8bf1e0b14e4b I'm happy to make changes if you feel anything needs changing for integration. This enhancement (or another workaround like my proposed FASTA filter by ID script) is something I need for some in-house workflows doing classification of proteins from FASTA files. Thanks, Peter

Peter

6 Dec 6 Dec

1:46 p.m.

On Wed, Nov 24, 2010 at 2:14 PM, Peter <biopython@maubp.freeserve.co.uk> wrote:

...

Hi all,

I guess you are probably busy with the planned galaxy-dist release at the moment, but could someone take a look at these proposed changes to the FASTA to tabular and FASTQ to tabular scripts?

http://bitbucket.org/peterjc/galaxy-central/src/tabular

i.e. For FASTA to tabular:

http://bitbucket.org/peterjc/galaxy-central/changeset/181614e79ccb http://bitbucket.org/peterjc/galaxy-central/changeset/d2fd2defa692

then for FASTQ to tabular:

http://bitbucket.org/peterjc/galaxy-central/changeset/52d50566f4af http://bitbucket.org/peterjc/galaxy-central/changeset/8bf1e0b14e4b

I'm happy to make changes if you feel anything needs changing for integration.

This enhancement (or another workaround like my proposed FASTA filter by ID script) is something I need for some in-house workflows doing classification of proteins from FASTA files.

Thanks,

Peter

Hi again, Can anyone take a look at these proposed changes? I don't mind being told you don't want to use them for the official repository (I'll just maintain a local branch or make them into separate tools). Regards, Peter

5338

Age (days ago)

5365

Last active (days ago)

List overview

Download

9 comments

2 participants

participants (2)

James Taylor
Peter