FASTA filtering by ID
Hi all, Something I want to do in several of my workflows is to filter a FASTA file (or potentially other format sequence files) using a list of desired identifiers (e.g. a column from a tabular file). Right now I can achieve this with three steps in Galaxy. Suppose I have: Dataset #1, FASTA file Dataset #2, Tabular file with identifiers of interest (e.g. BLAST hits, or filtered output from a sequence analysis tool) Then: Create tabular Dataset #3 using FASTA-to-tabular on Dataset #1, subject to the enhancement proposed here: http://lists.bx.psu.edu/pipermail/galaxy-dev/2010-November/003717.html Create tabular Dataset #4 using join on Datasets #2 and #3 using the matched identifier columns. This does the filtering. Create FASTA Dataset #5 using tabular-to-FASTA on Dataset #4. This works (at least for reasonably sized datasets), but requires three steps and the creation of at least two temporary files. I'd like to introduce another tool under "FASTA manipulation" to do it on one step (rather than three). Am I going against the apparent Galaxy ideal that complex manipulations should be done with tabular files? Would such a FASTA filter tool be of interest to add directly to Galaxy (e.g. under the "FASTA manipulation" section), or better off on the community tool shed? Here is my current implementation for discussion/consideration: http://bitbucket.org/peterjc/galaxy-central/changeset/730b89c4da26 Thanks, Peter
On Thu, Nov 18, 2010 at 5:46 PM, Peter <peter@maubp.freeserve.co.uk> wrote:
Hi all,
Something I want to do in several of my workflows is to filter a FASTA file (or potentially other format sequence files) using a list of desired identifiers (e.g. a column from a tabular file).
Right now I can achieve this with three steps in Galaxy. Suppose I have:
Dataset #1, FASTA file
Dataset #2, Tabular file with identifiers of interest (e.g. BLAST hits, or filtered output from a sequence analysis tool)
Then:
Create tabular Dataset #3 using FASTA-to-tabular on Dataset #1, subject to the enhancement proposed here: http://lists.bx.psu.edu/pipermail/galaxy-dev/2010-November/003717.html
Create tabular Dataset #4 using join on Datasets #2 and #3 using the matched identifier columns. This does the filtering.
Create FASTA Dataset #5 using tabular-to-FASTA on Dataset #4.
This works (at least for reasonably sized datasets), but requires three steps and the creation of at least two temporary files.
I'd like to introduce another tool under "FASTA manipulation" to do it on one step (rather than three). Am I going against the apparent Galaxy ideal that complex manipulations should be done with tabular files? Would such a FASTA filter tool be of interest to add directly to Galaxy (e.g. under the "FASTA manipulation" section), or better off on the community tool shed?
Here is my current implementation for discussion/consideration: http://bitbucket.org/peterjc/galaxy-central/changeset/730b89c4da26
I made some further updates on the branch, mainly adding unit tests: http://bitbucket.org/peterjc/galaxy-central/src/filter_fasta In addition to this FASTA filter by ID script (in Python using the Galaxy libraries) I've also got a FASTQ filter by ID script (also in Python using the Galaxy libraries) and an SFF filter by ID script (in Python using Biopython). Would any of these be of interest for the main Galaxy distribution, or should I bundle them up as a tool suite for the Tool Shed instead? Thanks, Peter
Hi Peter, Sorry for the delay, there was some discussion about this. The concept is great, but for now we have decided to leave the functionality as it is now and have users create workflows to do the same task on Galaxy main. Using the Tool Shed is a great solution for sharing this popular function! I'll keep an eye out and approve it quickly over there. Thanks again for all of your contributions! Jen Galaxy team On 12/6/10 5:55 AM, Peter wrote:
On Thu, Nov 18, 2010 at 5:46 PM, Peter<peter@maubp.freeserve.co.uk> wrote:
Hi all,
Something I want to do in several of my workflows is to filter a FASTA file (or potentially other format sequence files) using a list of desired identifiers (e.g. a column from a tabular file).
Right now I can achieve this with three steps in Galaxy. Suppose I have:
Dataset #1, FASTA file
Dataset #2, Tabular file with identifiers of interest (e.g. BLAST hits, or filtered output from a sequence analysis tool)
Then:
Create tabular Dataset #3 using FASTA-to-tabular on Dataset #1, subject to the enhancement proposed here: http://lists.bx.psu.edu/pipermail/galaxy-dev/2010-November/003717.html
Create tabular Dataset #4 using join on Datasets #2 and #3 using the matched identifier columns. This does the filtering.
Create FASTA Dataset #5 using tabular-to-FASTA on Dataset #4.
This works (at least for reasonably sized datasets), but requires three steps and the creation of at least two temporary files.
I'd like to introduce another tool under "FASTA manipulation" to do it on one step (rather than three). Am I going against the apparent Galaxy ideal that complex manipulations should be done with tabular files? Would such a FASTA filter tool be of interest to add directly to Galaxy (e.g. under the "FASTA manipulation" section), or better off on the community tool shed?
Here is my current implementation for discussion/consideration: http://bitbucket.org/peterjc/galaxy-central/changeset/730b89c4da26
I made some further updates on the branch, mainly adding unit tests: http://bitbucket.org/peterjc/galaxy-central/src/filter_fasta
In addition to this FASTA filter by ID script (in Python using the Galaxy libraries) I've also got a FASTQ filter by ID script (also in Python using the Galaxy libraries) and an SFF filter by ID script (in Python using Biopython).
Would any of these be of interest for the main Galaxy distribution, or should I bundle them up as a tool suite for the Tool Shed instead?
Thanks,
Peter _______________________________________________ galaxy-dev mailing list galaxy-dev@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-dev
-- Jennifer Jackson http://usegalaxy.org
On Tue, Dec 7, 2010 at 2:48 PM, Jennifer Jackson <jen@bx.psu.edu> wrote:
Hi Peter,
Sorry for the delay, there was some discussion about this. The concept is great, but for now we have decided to leave the functionality as it is now and have users create workflows to do the same task on Galaxy main.
Using the Tool Shed is a great solution for sharing this popular function! I'll keep an eye out and approve it quickly over there.
Thanks Jennifer, I'll try to wrap up these filter by ID tools as a tool suite then. If Galaxy later changes its mind about putting any of this into the core, I'm fine with that - just drop me an email so I can update the tool suite as needed. Regards, Peter
On Tue, Dec 7, 2010 at 3:15 PM, Peter <peter@maubp.freeserve.co.uk> wrote:
On Tue, Dec 7, 2010 at 2:48 PM, Jennifer Jackson <jen@bx.psu.edu> wrote:
Hi Peter,
Sorry for the delay, there was some discussion about this. The concept is great, but for now we have decided to leave the functionality as it is now and have users create workflows to do the same task on Galaxy main.
Using the Tool Shed is a great solution for sharing this popular function! I'll keep an eye out and approve it quickly over there.
Thanks Jennifer,
I'll try to wrap up these filter by ID tools as a tool suite then.
Hi Jennifer, I've just uploaded the FASTA filter by ID script as a tool. I'll try to do the FASTQ and SFF versions shortly. On reflection, since the SFF filter script will depend on Biopython, it seems simpler in some respects not to bundle it with the first two. I have been having some trouble with the unit tests for the FASTA filter - I may have found a problem with Galaxy itself... I'm still looking into it. Peter
Thanks Peter! The tool id = fasta_filter_by_id has been approved in the Galaxy Tool shed at http://community.g2.bx.psu.edu/ Please let us know if you find any errors, Best, Jen Galaxy team On 12/13/10 7:15 AM, Peter wrote:
On Tue, Dec 7, 2010 at 3:15 PM, Peter<peter@maubp.freeserve.co.uk> wrote:
On Tue, Dec 7, 2010 at 2:48 PM, Jennifer Jackson<jen@bx.psu.edu> wrote:
Hi Peter,
Sorry for the delay, there was some discussion about this. The concept is great, but for now we have decided to leave the functionality as it is now and have users create workflows to do the same task on Galaxy main.
Using the Tool Shed is a great solution for sharing this popular function! I'll keep an eye out and approve it quickly over there.
Thanks Jennifer,
I'll try to wrap up these filter by ID tools as a tool suite then.
Hi Jennifer,
I've just uploaded the FASTA filter by ID script as a tool. I'll try to do the FASTQ and SFF versions shortly.
On reflection, since the SFF filter script will depend on Biopython, it seems simpler in some respects not to bundle it with the first two.
I have been having some trouble with the unit tests for the FASTA filter - I may have found a problem with Galaxy itself... I'm still looking into it.
Peter
-- Jennifer Jackson http://usegalaxy.org
On Mon, Dec 13, 2010 at 3:15 PM, Peter <peter@maubp.freeserve.co.uk> wrote:
Hi Jennifer,
I've just uploaded the FASTA filter by ID script as a tool. I'll try to do the FASTQ and SFF versions shortly.
The SFF filter by ID script went live last month, and I've just uploaded the FASTQ filter by ID script to the Tool Shed. This was delayed since I was hoping to use a named input file's format, as discussed recently, see this thread: http://lists.bx.psu.edu/pipermail/galaxy-dev/2010-December/004051.html In the short term have employed a workaround to accept any FASTQ variant and preserve the variant on output based on how a similar problem is solved in tools/fastq/fastq_combiner.xml using the <change_format> tag and a list of all expected FASTQ variants. Peter
participants (2)
-
Jennifer Jackson
-
Peter