Extracting regions of sequences given table of co-ords

older
Toronto GMOD meeting registration...

Peter Cock

3 Aug 2011 3 Aug '11

3:56 p.m.

Hi all, I'm about to start working on a general tool for extracting sequences (from FASTA etc) based on a tabular file describing desired regions (this could be a BED or GFF3 file), giving a new sequence file with one entry for each region in the tabular file. Essentially I want as input columns describing: * FASTA sequence ID * Start * End * Strand (optional) * Region ID for this sequence in the output file (optional) In the case of a GFF3 input, those would be columns 1, 4, 5, 7 (with the ID being hidden inside column 9). But this isn't really driven by GFF3 where you'd probably not want to treat each line separately anyway (compound features). I'm thinking of very general examples like extracting general search matches (BLAST, HMMER, motifs, etc), but specifically things like cleaving proteins to remove a predicted signal peptide. This could even be used for trimming of sequencing reads (for small datasets, e.g. capillary reads). Is there anything like this in Galaxy already? If so I can't find it ;) I've also skimmed the tool shed. Regards, Peter

Show replies by date

Peter Cock

4 Aug 4 Aug

3:17 p.m.

New subject: Extracting regions of sequences given table of co-ords

On Wed, Aug 3, 2011 at 4:56 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...

Hi all,

I'm about to start working on a general tool for extracting sequences (from FASTA etc) based on a tabular file describing desired regions (this could be a BED or GFF3 file), giving a new sequence file with one entry for each region in the tabular file. Essentially I want as input columns describing:

* FASTA sequence ID * Start * End * Strand (optional) * Region ID for this sequence in the output file (optional)

In the case of a GFF3 input, those would be columns 1, 4, 5, 7 (with the ID being hidden inside column 9). But this isn't really driven by GFF3 where you'd probably not want to treat each line separately anyway (compound features).

I'm thinking of very general examples like extracting general search matches (BLAST, HMMER, motifs, etc), but specifically things like cleaving proteins to remove a predicted signal peptide. This could even be used for trimming of sequencing reads (for small datasets, e.g. capillary reads).

Is there anything like this in Galaxy already? If so I can't find it ;)

I've also skimmed the tool shed.

It turns out there sort of is, extract/extract_genomic_dna.xml However, as written it isn't clear to me how to make this work with general tabular files, and how it would behave with proteins (which in principle can be process the same way). It also appears to insist on having both start and end present (I want to be able to have these default to the start/end of the referenced parent sequence). Peter

Peter Cock

11 Aug 11 Aug

11:23 a.m.

New subject: Extracting regions of sequences given table of co-ords

On Thu, Aug 4, 2011 at 4:17 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...

On Wed, Aug 3, 2011 at 4:56 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...
Hi all,

I'm about to start working on a general tool for extracting sequences (from FASTA etc) based on a tabular file describing desired regions (this could be a BED or GFF3 file), giving a new sequence file with one entry for each region in the tabular file. Essentially I want as input columns describing:

* FASTA sequence ID * Start * End * Strand (optional) * Region ID for this sequence in the output file (optional)

In the case of a GFF3 input, those would be columns 1, 4, 5, 7 (with the ID being hidden inside column 9). But this isn't really driven by GFF3 where you'd probably not want to treat each line separately anyway (compound features).

I'm thinking of very general examples like extracting general search matches (BLAST, HMMER, motifs, etc), but specifically things like cleaving proteins to remove a predicted signal peptide. This could even be used for trimming of sequencing reads (for small datasets, e.g. capillary reads).

Is there anything like this in Galaxy already? If so I can't find it ;)

I've also skimmed the tool shed.

It turns out there sort of is, extract/extract_genomic_dna.xml

However, as written it isn't clear to me how to make this work with general tabular files, and how it would behave with proteins (which in principle can be process the same way).

It also appears to insist on having both start and end present (I want to be able to have these default to the start/end of the referenced parent sequence).

Peter

Any thoughts from the Galaxy team? Are you open to making extract/extract_genomic_dna.xml more general - for instance renaming and rewording to indicate work on protein FASTA files as well as nucleotide FASTA files and reference genomes (assuming it does work)? Peter

Jeremy Goecks

1:35 p.m.

New subject: Extracting regions of sequences given table of co-ords

...

...
It turns out there sort of is, extract/extract_genomic_dna.xml

However, as written it isn't clear to me how to make this work with general tabular files, and how it would behave with proteins (which in principle can be process the same way).

It also appears to insist on having both start and end present (I want to be able to have these default to the start/end of the referenced parent sequence).

Peter

Any thoughts from the Galaxy team? Are you open to making extract/extract_genomic_dna.xml more general - for instance renaming and rewording to indicate work on protein FASTA files as well as nucleotide FASTA files and reference genomes (assuming it does work)?

Peter, We're open to enhancing this tool. Here are my thoughts: *Enabling the tool to work with general tabular formats would seem to be a relatively low priority because it's easy to convert a general tabular file into a simple BED file by cutting columns. If you plan to support a general tabular format, the GOPS tools provide examples of how to work with BED, GFF, and interval datasets. *The tool should work with general fasta files right now, so the changes needed to support protein and other fastas should be UI-based only. Best, J.

Peter Cock

2:55 p.m.

New subject: Extracting regions of sequences given table of co-ords

Jeremy wrote:

...

Peter wrote:

...
Any thoughts from the Galaxy team? Are you open to making extract/extract_genomic_dna.xml more general - for instance renaming and rewording to indicate work on protein FASTA files as well as nucleotide FASTA files and reference genomes (assuming it does work)?

Peter,

We're open to enhancing this tool. Here are my thoughts:

*Enabling the tool to work with general tabular formats would seem to be a relatively low priority because it's easy to convert a general tabular file into a simple BED file by cutting columns.

For some of the example use cases I have in mind, I'd first have to generate a start column of ones (beginning of the sequence), and an end column of the length of the sequence. But yes, I can see how this could be done.

...

If you plan to support a general tabular format, the GOPS tools provide examples of how to work with BED, GFF, and interval datasets.

Putting GOPS into the Galaxy tool search brings up the Operate on Genomic Intervals section, is GOPS short genomic interval operations?

...

*The tool should work with general fasta files right now, so the changes needed to support protein and other fastas should be UI-based only.

Right, while of course preserving the (now misleading) tool ID of "Extract genomic DNA 1" and filename of extract/extract_genomic_dna.xml for backwards compatibility. I'll take a look at this then - thanks, Peter

Jeremy Goecks

3:11 p.m.

New subject: Extracting regions of sequences given table of co-ords

...

Putting GOPS into the Galaxy tool search brings up the Operate on Genomic Intervals section, is GOPS short genomic interval operations?

Correct.

...

...
*The tool should work with general fasta files right now, so the changes needed to support protein and other fastas should be UI-based only.

Right, while of course preserving the (now misleading) tool ID of "Extract genomic DNA 1" and filename of extract/extract_genomic_dna.xml for backwards compatibility.

Only the tool ID is required for backwards compatibility; the filename can be changed. J.

Peter Cock

12 Aug 12 Aug

9:36 a.m.

New subject: Extracting regions of sequences given table of co-ords

On Thu, Aug 11, 2011 at 4:11 PM, Jeremy Goecks <jeremy.goecks@emory.edu> wrote:

...

...
...
*The tool should work with general fasta files right now, so the changes needed to support protein and other fastas should be UI-based only.

Right, while of course preserving the (now misleading) tool ID of "Extract genomic DNA 1" and filename of extract/extract_genomic_dna.xml for backwards compatibility.

Only the tool ID is required for backwards compatibility; the filename can be changed.

OK. First, some minor house keeping - if you can clarify the text for how a user should go from a tabular file to an interval or GFF file, that would be good: https://bitbucket.org/peterjc/galaxy-central/changeset/d41ee8c41532 Could you transplant/merge that please? Its the only commit so far on my new branch: https://bitbucket.org/peterjc/galaxy-central/src/extract_region Regards, Peter

Kanwei Li

6:38 p.m.

New subject: Extracting regions of sequences given table of co-ords

Done On Fri, Aug 12, 2011 at 5:36 AM, Peter Cock <p.j.a.cock@googlemail.com>wrote:

...

On Thu, Aug 11, 2011 at 4:11 PM, Jeremy Goecks <jeremy.goecks@emory.edu> wrote:

...
...
...
*The tool should work with general fasta files right now, so the changes needed to support protein and other fastas should be UI-based only.

Right, while of course preserving the (now misleading) tool ID of "Extract genomic DNA 1" and filename of extract/extract_genomic_dna.xml for backwards compatibility.

Only the tool ID is required for backwards compatibility; the filename can be changed.

OK.

First, some minor house keeping - if you can clarify the text for how a user should go from a tabular file to an interval or GFF file, that would be good:

https://bitbucket.org/peterjc/galaxy-central/changeset/d41ee8c41532

Could you transplant/merge that please? Its the only commit so far on my new branch:

https://bitbucket.org/peterjc/galaxy-central/src/extract_region

Regards,

Peter

___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

Peter Cock

7:51 p.m.

New subject: Extracting regions of sequences given table of co-ords

On Fri, Aug 12, 2011 at 5:36 AM, Peter Cock <p.j.a.cock@googlemail.com>

...

...
First, some minor house keeping - if you can clarify the text for how a user should go from a tabular file to an interval or GFF file, that would be good:

https://bitbucket.org/peterjc/galaxy-central/changeset/d41ee8c41532

Could you transplant/merge that please? Its the only commit so far on my new branch:

https://bitbucket.org/peterjc/galaxy-central/src/extract_region

Regards,

Peter

On Fri, Aug 12, 2011 at 7:38 PM, Kanwei Li <kanwei@gmail.com> wrote:

...

Done

Thanks! Peter

5073

Age (days ago)

5082

Last active (days ago)

List overview

Download

8 comments

3 participants

participants (3)

Jeremy Goecks
Kanwei Li
Peter Cock

Extracting regions of sequences given table of co-ords

tags

participants (3)