Extract genomic DNA-strand information is not recognized
Hello, I am trying to extract sequences from a FASTA file containing genomic information. The coordinates are in a tab-delimited format, which is recognized as BED format by Galaxy (meaning that the 6th column is correctly interpreted as 6. Strand). However, upon running "Fetch sequences" , Extract Genomic DNA only the +-strand information is included in the output FASTA file and I receive the following ERROR message: 1,431 sequences format: fasta, database: ? Info: 1476 warnings, 1st is: Invalid interval, start '1616' > end '1177'. Skipped 1476 invalid lines, 1st is #2, "scaffold00001 1616 1177 Fom - 1" Is this a bug? How can I can adjust my input data files to get the --strand sequences as well? I have seen a similar problem in an earlier posting and there it was suggested to manually adjust the strand information column 5, but this did not work for me neither. Many thanks for your all help!!!!! Sarah
Hi Sarah, One of the specifications of BED format is that the coordinates are with respect to the forward strand. BED format originated at UCSC, and this is their full specification: http://genome.ucsc.edu/FAQ/FAQformat.html#format1 And Galaxy's summary (also on tool forms that accept BED format): http://galaxyproject.org/wiki/Learn/Datatypes#Bed The rules to transform data in other coordinate formats to BED is explained in detail in this UCSC wiki document: http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms There are no Galaxy wrapped automated tools to do this transformation, but perhaps someone on the mailing list has a workflow to offer. If not, the tools in Galaxy under "Text Manipulation" and "Filter and Sort" and a file containing the length of each chromosome can very likely be used in combination to perform the calculations (in several steps). If you create a process to do this, be sure to considering publishing the workflow for others to use. Hopefully this helps, Best, Jen Galaxy team On 10/24/11 7:20 PM, Sarah wrote:
Hello,
I am trying to extract sequences from a FASTA file containing genomic information. The coordinates are in a tab-delimited format, which is recognized as BED format by Galaxy (meaning that the 6th column is correctly interpreted as 6. Strand).
However, upon running "Fetch sequences" , Extract Genomic DNA only the +-strand information is included in the output FASTA file and I receive the following ERROR message:
1,431 sequences format: fasta, database: ? <http://main.g2.bx.psu.edu/datasets/2cccb18df8c9d753/edit> Info: 1476 warnings, 1st is: Invalid interval, start '1616' > end '1177'. Skipped 1476 invalid lines, 1st is #2, "scaffold00001 1616 1177 Fom - 1"
Is this a bug? How can I can adjust my input data files to get the --strand sequences as well?
I have seen a similar problem in an earlier posting and there it was suggested to manually adjust the strand information column 5, but this did not work for me neither.
Many thanks for your all help!!!!!
Sarah
___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists, please use the interface at:
-- Jennifer Jackson http://usegalaxy.org http://galaxyproject.org/wiki/Support
participants (2)
-
Jennifer Jackson
-
Sarah