Hello Jen:
thank you for the detailed reply. I am, however, not yet convinced that everything is working as it should be. To make it easier for you and others to reproduce the discrepancy, I prepared two shared Galaxy histories: http://main.g2.bx.psu.edu/u/Eckart/h/merging-with-and-without-strand-info-dm... and http://main.g2.bx.psu.edu/u/Eckart/h/merging-with-and-without-strand-info-hg...
In the first case, I obtain 110,472 drosophila (dm3) exons annotated by Flybase from the UCSC site (UCSC table flyBaseGene). Running merge on that data set, one obtains 59,228 regions (data set 2 in that history). Data set 3 in that history is a copy of data set 2, with the strand column deactivated. Running merge now on data set 3, one obtains 59,236 regions. There is a discrepancy of 8 regions, that, for some reason, are not appearing in the result of the merge operation if the strand information is used. The discrepancy in the form of 8 regions is shown in data set 7 in that history, obtained using the subtract tool.
I re-ran that procedure using exons of human genes in the second history mentioned above (UCSC table UCSC Genes - knownGene). In that case both versions of the merge operation give the exact same result, as it should be. According to the tool documentation, the strand information is being ignored in the merge tool. So there should not be a difference as shown in the case of the drosophila exons.
Please try to reproduce these steps. I found similar differences for the coverage tool and the subtract tool, depending on wether strand information is activated or not. The fact that I only find these differences in Drosophila exons and not other data sets that I have tested, makes it seem likely to me that there is an issue with the way that particular data set is internally represented.
Thanks,
Eckart
On Dec 1, 2010, at 1:04 PM, Jennifer Jackson wrote:
Hello Eckart,
It may be helpful to review the help for the Interval tools (includes Merge): http://bitbucket.org/galaxy/galaxy-central/wiki/GopsDesc
quote from wiki/GopsDesc help: "Merge reads a dataset, and combines all overlapping intervals into single intervals. When merging intervals, all columns besides chromosome, start, and end are lost. When two intervals are combined into one, it is ambiguous what the other columns represent or which field should be carried over to the resulting interval. For this reason, all columns except for chromosome, start and end are omitted from the output."
The output coordinates are based on the positive strand as the default. This is the common format for BED, Interval and many other datatypes (but not all!).
Apologies for the late reply. Please see the inline comments below to your specific questions.
Best,
Jen Galaxy team
On 11/17/10 9:28 AM, Eckart Bindewald wrote:
Hello:
let me start by saying that I am very impressed by the service the Galaxy web server provides to the community; it has proven very useful for my work. Today I came across a situation that puzzles me. I am trying to merge exons corresponding to the same gene (but possibly from different splice variants). At the bottom of this email I am listing, as an example, the 153 exons that are related to the different splice variants of FlyBase gene CG32491 (obtained by the pattern matching (tool "Select lines that match an expression" and pattern .+CG32491-. ) applied to the data set of FlyBaseGene exons (110,472 exons, genome assembly dm3). I am using bed format and the general Galaxy web server. If I now apply the "Merge" tool to the intervals, I obtain 26 intervals (listed further below). Now applying the "subtract" tool to the original 153 exons results in 8 "leftover" regions that I did not expect. Somehow they seem to be missing in the merge result. I then deactivated the strand information in the interval set of 153 exons. Applying the merge tool now results in 34 intervals (again listed below). Checking the result via the subtract tool (subtracting the merge result from the original data set of 153 exons) results, as expected, in zero intervals.
So my questions are:
- is this the intended functionality of the tools? Maybe one can add
statements regarding these issues in the tool documentation.
please see the wiki help link above
- why does the outcome of the merge operation depend on whether the
"strand" column is set or not? The original set of intervals all had the same negative strand orientation, so it appears to me that the merge operation should give the same result in both cases.
if strand is not set, then (+) strand is assumed. if strand is set, then that will be used (in your case: (-)). In either case, the result is transformed into (+) coordinates. This is why you are getting different results.
- subtracting the merged intervals (that do not have strand
information) from the set of 153 intervals results in 8 strands that now have positive strand orientation (they originally had negative strand orientation). Why does subtracting a set of intervals without strand information from a set of intervals with strand information change the strand orientation of the first set?
It is best to have the file types be the same or unexpected results can be produced. Hopefully the wiki can help you create a query that will produce the desired result.
Any comments are highly appreciated!
Thanks,
Eckart
Dr. Eckart Bindewald (Contractor) SAIC-Frederick, Inc. Center for Cancer Research Nanobiology Program National Cancer Institute P.O. Box B Frederick, MD 21702 USA Phone: 301-846-5538 Fax: 301-846-5598 E-mail: eckart@mail.nih.gov
Here is the result (34 regions) of the merge operation (not using strand orientation) applied to the 153 exon regions listed further below ; chr3R 17177330 17177608 chr3R 17177760 17178959 chr3R 17179070 17179456 chr3R 17179617 17180053 chr3R 17180159 17180416 chr3R 17180695 17181279 chr3R 17181479 17181973 chr3R 17182071 17182426 chr3R 17182532 17182690 chr3R 17182776 17183086 chr3R 17183242 17183480 chr3R 17183726 17183926 chr3R 17184011 17184791 chr3R 17186111 17186276 chr3R 17186349 17187009 chr3R 17187119 17187332 chr3R 17187391 17187860 chr3R 17187909 17188590 chr3R 17188688 17189606 chr3R 17189739 17190097 chr3R 17190173 17190367 chr3R 17190435 17190714 chr3R 17191725 17192060 chr3R 17192171 17192466 chr3R 17193631 17193960 chr3R 17194101 17194784 chr3R 17195183 17196364 chr3R 17196654 17196949 chr3R 17197044 17197789 chr3R 17197884 17198802 chr3R 17200781 17201634 chr3R 17202323 17202463 chr3R 17202540 17202798 chr3R 17203009 17203121
Here is the result (26 regions) of the merge operation (using strand orientation) applied to the 153 exon regions listed further below ; chr3R 17177330 17177608 chr3R 17177760 17178959 chr3R 17179070 17179456 chr3R 17179617 17180053 chr3R 17180159 17180416 chr3R 17180695 17181279 chr3R 17181479 17181973 chr3R 17182071 17182426 chr3R 17182532 17182690 chr3R 17182776 17183086 chr3R 17183242 17183480 chr3R 17183726 17183926 chr3R 17184011 17184791 chr3R 17187909 17188590 chr3R 17188688 17189606 chr3R 17189739 17190097 chr3R 17190173 17190367 chr3R 17190435 17190714 chr3R 17195821 17196364 chr3R 17196654 17196949 chr3R 17197044 17197789 chr3R 17197884 17198802 chr3R 17200781 17201634 chr3R 17202323 17202463 chr3R 17202540 17202798 chr3R 17203009 17203121
Here are the 8 "leftover" regions from the original 153 exons that do not intersect with the result of the 26 merged regions (result of subtract tool of 153 exons that do not overlap with 26 merged exons; note the change strand orientation): chr3R 17186111 17186276 CG32491-RT_exon_0_0_chr3R_17186112_f 0 + chr3R 17186349 17187009 CG32491-RT_exon_1_0_chr3R_17186350_f 0 + chr3R 17187119 17187332 CG32491-RZ_exon_0_0_chr3R_17187120_f 0 + chr3R 17187391 17187860 CG32491-RZ_exon_1_0_chr3R_17187392_f 0 + chr3R 17191725 17192060 CG32491-RY_exon_0_0_chr3R_17191726_f 0 + chr3R 17192171 17192466 CG32491-RX_exon_0_0_chr3R_17192172_f 0 + chr3R 17193631 17193960 CG32491-RW_exon_0_0_chr3R_17193632_f 0 + chr3R 17194101 17194784 CG32491-RV_exon_0_0_chr3R_17194102_f 0 +
Here are the 153 exons related to FlyBase gene CG32491 obtained by the pattern matching (tool "Select lines that match an expression" and pattern .+CG32491-. ) applied to the data set of FlyBaseGene exons (110,472 exons): chr3R 17177330 17177608 CG32491-RR_exon_0_0_chr3R_17177331_r 0 - chr3R 17200781 17201634 CG32491-RR_exon_1_0_chr3R_17200782_r 0 - chr3R 17202323 17202463 CG32491-RR_exon_2_0_chr3R_17202324_r 0 - chr3R 17202540 17202798 CG32491-RR_exon_3_0_chr3R_17202541_r 0 - chr3R 17203009 17203121 CG32491-RR_exon_4_0_chr3R_17203010_r 0 - chr3R 17177760 17178358 CG32491-RA_exon_0_0_chr3R_17177761_r 0 - chr3R 17200781 17201634 CG32491-RA_exon_1_0_chr3R_17200782_r 0 - chr3R 17202323 17202463 CG32491-RA_exon_2_0_chr3R_17202324_r 0 - chr3R 17202540 17202798 CG32491-RA_exon_3_0_chr3R_17202541_r 0 - chr3R 17203009 17203121 CG32491-RA_exon_4_0_chr3R_17203010_r 0 - chr3R 17178092 17178959 CG32491-RF_exon_0_0_chr3R_17178093_r 0 - chr3R 17200781 17201634 CG32491-RF_exon_1_0_chr3R_17200782_r 0 - chr3R 17202323 17202463 CG32491-RF_exon_2_0_chr3R_17202324_r 0 - chr3R 17202540 17202798 CG32491-RF_exon_3_0_chr3R_17202541_r 0 - chr3R 17203009 17203121 CG32491-RF_exon_4_0_chr3R_17203010_r 0 - chr3R 17179070 17179456 CG32491-RD_exon_0_0_chr3R_17179071_r 0 - chr3R 17200781 17201634 CG32491-RD_exon_1_0_chr3R_17200782_r 0 - chr3R 17202323 17202463 CG32491-RD_exon_2_0_chr3R_17202324_r 0 - chr3R 17202540 17202798 CG32491-RD_exon_3_0_chr3R_17202541_r 0 - chr3R 17203009 17203121 CG32491-RD_exon_4_0_chr3R_17203010_r 0 - chr3R 17179617 17180053 CG32491-RAC_exon_0_0_chr3R_17179618_r 0 - chr3R 17200781 17201634 CG32491-RAC_exon_1_0_chr3R_17200782_r 0 - chr3R 17202323 17202463 CG32491-RAC_exon_2_0_chr3R_17202324_r 0 - chr3R 17202540 17202798 CG32491-RAC_exon_3_0_chr3R_17202541_r 0 - chr3R 17203009 17203121 CG32491-RAC_exon_4_0_chr3R_17203010_r 0 - chr3R 17180159 17180416 CG32491-RG_exon_0_0_chr3R_17180160_r 0 - chr3R 17180695 17180811 CG32491-RG_exon_1_0_chr3R_17180696_r 0 - chr3R 17200781 17201634 CG32491-RG_exon_2_0_chr3R_17200782_r 0 - chr3R 17202323 17202463 CG32491-RG_exon_3_0_chr3R_17202324_r 0 - chr3R 17202540 17202798 CG32491-RG_exon_4_0_chr3R_17202541_r 0 - chr3R 17203009 17203121 CG32491-RG_exon_5_0_chr3R_17203010_r 0 - chr3R 17180159 17180416 CG32491-RH_exon_0_0_chr3R_17180160_r 0 - chr3R 17180695 17181279 CG32491-RH_exon_1_0_chr3R_17180696_r 0 - chr3R 17200781 17201634 CG32491-RH_exon_2_0_chr3R_17200782_r 0 - chr3R 17202323 17202463 CG32491-RH_exon_3_0_chr3R_17202324_r 0 - chr3R 17202540 17202798 CG32491-RH_exon_4_0_chr3R_17202541_r 0 - chr3R 17203009 17203121 CG32491-RH_exon_5_0_chr3R_17203010_r 0 - chr3R 17180159 17180416 CG32491-RQ_exon_0_0_chr3R_17180160_r 0 - chr3R 17200781 17201634 CG32491-RQ_exon_1_0_chr3R_17200782_r 0 - chr3R 17202323 17202463 CG32491-RQ_exon_2_0_chr3R_17202324_r 0 - chr3R 17202540 17202798 CG32491-RQ_exon_3_0_chr3R_17202541_r 0 - chr3R 17203009 17203121 CG32491-RQ_exon_4_0_chr3R_17203010_r 0 - chr3R 17180941 17181279 CG32491-RB_exon_0_0_chr3R_17180942_r 0 - chr3R 17181479 17181973 CG32491-RB_exon_1_0_chr3R_17181480_r 0 - chr3R 17200781 17201634 CG32491-RB_exon_2_0_chr3R_17200782_r 0 - chr3R 17202323 17202463 CG32491-RB_exon_3_0_chr3R_17202324_r 0 - chr3R 17202540 17202798 CG32491-RB_exon_4_0_chr3R_17202541_r 0 - chr3R 17203009 17203121 CG32491-RB_exon_5_0_chr3R_17203010_r 0 - chr3R 17182071 17182426 CG32491-RI_exon_0_0_chr3R_17182072_r 0 - chr3R 17182532 17182690 CG32491-RI_exon_1_0_chr3R_17182533_r 0 - chr3R 17200781 17201634 CG32491-RI_exon_2_0_chr3R_17200782_r 0 - chr3R 17202323 17202463 CG32491-RI_exon_3_0_chr3R_17202324_r 0 - chr3R 17202540 17202798 CG32491-RI_exon_4_0_chr3R_17202541_r 0 - chr3R 17203009 17203121 CG32491-RI_exon_5_0_chr3R_17203010_r 0 - chr3R 17182776 17183086 CG32491-RJ_exon_0_0_chr3R_17182777_r 0 - chr3R 17200781 17201634 CG32491-RJ_exon_1_0_chr3R_17200782_r 0 - chr3R 17202323 17202463 CG32491-RJ_exon_2_0_chr3R_17202324_r 0 - chr3R 17202540 17202798 CG32491-RJ_exon_3_0_chr3R_17202541_r 0 - chr3R 17203009 17203121 CG32491-RJ_exon_4_0_chr3R_17203010_r 0 - chr3R 17183242 17183480 CG32491-RP_exon_0_0_chr3R_17183243_r 0 - chr3R 17183726 17183926 CG32491-RP_exon_1_0_chr3R_17183727_r 0 - chr3R 17200781 17201634 CG32491-RP_exon_2_0_chr3R_17200782_r 0 - chr3R 17202323 17202463 CG32491-RP_exon_3_0_chr3R_17202324_r 0 - chr3R 17202540 17202798 CG32491-RP_exon_4_0_chr3R_17202541_r 0 - chr3R 17203009 17203121 CG32491-RP_exon_5_0_chr3R_17203010_r 0 - chr3R 17184011 17184791 CG32491-RK_exon_0_0_chr3R_17184012_r 0 - chr3R 17200781 17201634 CG32491-RK_exon_1_0_chr3R_17200782_r 0 - chr3R 17202323 17202463 CG32491-RK_exon_2_0_chr3R_17202324_r 0 - chr3R 17202540 17202798 CG32491-RK_exon_3_0_chr3R_17202541_r 0 - chr3R 17203009 17203121 CG32491-RK_exon_4_0_chr3R_17203010_r 0 - chr3R 17184021 17184318 CG32491-RL_exon_0_0_chr3R_17184022_r 0 - chr3R 17200781 17201634 CG32491-RL_exon_1_0_chr3R_17200782_r 0 - chr3R 17202323 17202463 CG32491-RL_exon_2_0_chr3R_17202324_r 0 - chr3R 17202540 17202798 CG32491-RL_exon_3_0_chr3R_17202541_r 0 - chr3R 17203009 17203121 CG32491-RL_exon_4_0_chr3R_17203010_r 0 - chr3R 17186111 17186276 CG32491-RT_exon_0_0_chr3R_17186112_f 0 . chr3R 17186349 17187009 CG32491-RT_exon_1_0_chr3R_17186350_f 0 . chr3R 17200781 17201634 CG32491-RT_exon_2_0_chr3R_17200782_f 0 . chr3R 17202323 17202463 CG32491-RT_exon_3_0_chr3R_17202324_f 0 . chr3R 17202540 17202798 CG32491-RT_exon_4_0_chr3R_17202541_f 0 . chr3R 17203009 17203121 CG32491-RT_exon_5_0_chr3R_17203010_f 0 . chr3R 17187119 17187332 CG32491-RZ_exon_0_0_chr3R_17187120_f 0 . chr3R 17187391 17187860 CG32491-RZ_exon_1_0_chr3R_17187392_f 0 . chr3R 17200781 17201634 CG32491-RZ_exon_2_0_chr3R_17200782_f 0 . chr3R 17202323 17202463 CG32491-RZ_exon_3_0_chr3R_17202324_f 0 . chr3R 17202540 17202798 CG32491-RZ_exon_4_0_chr3R_17202541_f 0 . chr3R 17203009 17203121 CG32491-RZ_exon_5_0_chr3R_17203010_f 0 . chr3R 17187909 17188590 CG32491-RM_exon_0_0_chr3R_17187910_r 0 - chr3R 17200781 17201634 CG32491-RM_exon_1_0_chr3R_17200782_r 0 - chr3R 17202323 17202463 CG32491-RM_exon_2_0_chr3R_17202324_r 0 - chr3R 17202540 17202798 CG32491-RM_exon_3_0_chr3R_17202541_r 0 - chr3R 17203009 17203121 CG32491-RM_exon_4_0_chr3R_17203010_r 0 - chr3R 17188688 17189606 CG32491-RE_exon_0_0_chr3R_17188689_r 0 - chr3R 17200781 17201634 CG32491-RE_exon_1_0_chr3R_17200782_r 0 - chr3R 17202323 17202463 CG32491-RE_exon_2_0_chr3R_17202324_r 0 - chr3R 17202540 17202798 CG32491-RE_exon_3_0_chr3R_17202541_r 0 - chr3R 17203009 17203121 CG32491-RE_exon_4_0_chr3R_17203010_r 0 - chr3R 17189739 17190097 CG32491-RAB_exon_0_0_chr3R_17189740_r 0 - chr3R 17200781 17201634 CG32491-RAB_exon_1_0_chr3R_17200782_r 0 - chr3R 17202323 17202463 CG32491-RAB_exon_2_0_chr3R_17202324_r 0 - chr3R 17202540 17202798 CG32491-RAB_exon_3_0_chr3R_17202541_r 0 - chr3R 17203009 17203121 CG32491-RAB_exon_4_0_chr3R_17203010_r 0 - chr3R 17190173 17190367 CG32491-RC_exon_0_0_chr3R_17190174_r 0 - chr3R 17190435 17190714 CG32491-RC_exon_1_0_chr3R_17190436_r 0 - chr3R 17200781 17201634 CG32491-RC_exon_2_0_chr3R_17200782_r 0 - chr3R 17202323 17202463 CG32491-RC_exon_3_0_chr3R_17202324_r 0 - chr3R 17202540 17202798 CG32491-RC_exon_4_0_chr3R_17202541_r 0 - chr3R 17203009 17203121 CG32491-RC_exon_5_0_chr3R_17203010_r 0 - chr3R 17191725 17192060 CG32491-RY_exon_0_0_chr3R_17191726_f 0 . chr3R 17200781 17201634 CG32491-RY_exon_1_0_chr3R_17200782_f 0 . chr3R 17202323 17202463 CG32491-RY_exon_2_0_chr3R_17202324_f 0 . chr3R 17202540 17202798 CG32491-RY_exon_3_0_chr3R_17202541_f 0 . chr3R 17203009 17203121 CG32491-RY_exon_4_0_chr3R_17203010_f 0 . chr3R 17192171 17192466 CG32491-RX_exon_0_0_chr3R_17192172_f 0 . chr3R 17200781 17201634 CG32491-RX_exon_1_0_chr3R_17200782_f 0 . chr3R 17202323 17202463 CG32491-RX_exon_2_0_chr3R_17202324_f 0 . chr3R 17202540 17202798 CG32491-RX_exon_3_0_chr3R_17202541_f 0 . chr3R 17203009 17203121 CG32491-RX_exon_4_0_chr3R_17203010_f 0 . chr3R 17193631 17193960 CG32491-RW_exon_0_0_chr3R_17193632_f 0 . chr3R 17200781 17201634 CG32491-RW_exon_1_0_chr3R_17200782_f 0 . chr3R 17202323 17202463 CG32491-RW_exon_2_0_chr3R_17202324_f 0 . chr3R 17202540 17202798 CG32491-RW_exon_3_0_chr3R_17202541_f 0 . chr3R 17203009 17203121 CG32491-RW_exon_4_0_chr3R_17203010_f 0 . chr3R 17194101 17194784 CG32491-RV_exon_0_0_chr3R_17194102_f 0 . chr3R 17200781 17201634 CG32491-RV_exon_1_0_chr3R_17200782_f 0 . chr3R 17202323 17202463 CG32491-RV_exon_2_0_chr3R_17202324_f 0 . chr3R 17202540 17202798 CG32491-RV_exon_3_0_chr3R_17202541_f 0 . chr3R 17203009 17203121 CG32491-RV_exon_4_0_chr3R_17203010_f 0 . chr3R 17195183 17195967 CG32491-RU_exon_0_0_chr3R_17195184_f 0 . chr3R 17200781 17201634 CG32491-RU_exon_1_0_chr3R_17200782_f 0 . chr3R 17202323 17202463 CG32491-RU_exon_2_0_chr3R_17202324_f 0 . chr3R 17202540 17202798 CG32491-RU_exon_3_0_chr3R_17202541_f 0 . chr3R 17203009 17203121 CG32491-RU_exon_4_0_chr3R_17203010_f 0 . chr3R 17195821 17196364 CG32491-RS_exon_0_0_chr3R_17195822_r 0 - chr3R 17200781 17201634 CG32491-RS_exon_1_0_chr3R_17200782_r 0 - chr3R 17202323 17202463 CG32491-RS_exon_2_0_chr3R_17202324_r 0 - chr3R 17202540 17202798 CG32491-RS_exon_3_0_chr3R_17202541_r 0 - chr3R 17203009 17203121 CG32491-RS_exon_4_0_chr3R_17203010_r 0 - chr3R 17196654 17196949 CG32491-RAA_exon_0_0_chr3R_17196655_r 0 - chr3R 17200781 17201634 CG32491-RAA_exon_1_0_chr3R_17200782_r 0 - chr3R 17202323 17202463 CG32491-RAA_exon_2_0_chr3R_17202324_r 0 - chr3R 17202540 17202798 CG32491-RAA_exon_3_0_chr3R_17202541_r 0 - chr3R 17203009 17203121 CG32491-RAA_exon_4_0_chr3R_17203010_r 0 - chr3R 17197044 17197789 CG32491-RO_exon_0_0_chr3R_17197045_r 0 - chr3R 17200781 17201634 CG32491-RO_exon_1_0_chr3R_17200782_r 0 - chr3R 17202323 17202463 CG32491-RO_exon_2_0_chr3R_17202324_r 0 - chr3R 17202540 17202798 CG32491-RO_exon_3_0_chr3R_17202541_r 0 - chr3R 17203009 17203121 CG32491-RO_exon_4_0_chr3R_17203010_r 0 - chr3R 17197884 17198802 CG32491-RN_exon_0_0_chr3R_17197885_r 0 - chr3R 17200781 17201634 CG32491-RN_exon_1_0_chr3R_17200782_r 0 - chr3R 17202323 17202463 CG32491-RN_exon_2_0_chr3R_17202324_r 0 - chr3R 17202540 17202798 CG32491-RN_exon_3_0_chr3R_17202541_r 0 - chr3R 17203009 17203121 CG32491-RN_exon_4_0_chr3R_17203010_r 0 -
galaxy-user mailing list galaxy-user@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-user
-- Jennifer Jackson http://usegalaxy.org