Re: [galaxy-user] operate on genomic intervals

9 May 2012

      Hi Jose,

That's great news! A tool tip in the wiki or UI would probably be 
helpful - your question was a good one. Meanwhile, I'll post your 
results to back to the list, it may help others who are also working to 
optimize.

Glad that it worked out so well,

Jen
Galaxy team

On 5/8/12 2:57 PM, Xianrong Wong wrote:
...
Thank you Jennifer!  I swapped the order of the input datasets which
reduced the running time from 2 hrs to 10 min!
Jose
On Mon, May 7, 2012 at 12:03 PM, Jennifer Jackson <jen@bx.psu.edu
<mailto:jen@bx.psu.edu>> wrote:
Hi Jose,
Very glad to know that you have this working.
You question is difficult to address with specificity, as these are
    completely different algorithms. But in general, any alignment
    algorithm (Bowtie included) has some sort of indexing strategy (some
    are better than others) to minimize what is held in memory and
    process bulk data through. See the Bowtie documentation for how this
    is achieved.
The interval operations tools also have an indexing strategy,
    specifically, the second input file is the portion loaded memory and
    the first file is processed against it. So, if you want use an
    extremely large dataset (or just want to the job to run quicker) try
    to use it as the first input file if possible.
These tools are designed to be used together and with other tools to
    create workflows, so there should pretty much always be some way to
    break jobs up (as you did) to get them through the tools, on even
    modest systems:
    http://wiki.g2.bx.psu.edu/__Learn/Interval%20Operations
    <http://wiki.g2.bx.psu.edu/Learn/Interval%20Operations>
Take care,
Jen
    Galaxy team
On 5/7/12 8:30 AM, Xianrong Wong wrote:
Hello Jennifer,  thanks for the advise.  It worked when I did it a
        chromosome at a time.  Is there a reason why this is so much more
        computationally heavy as compared to bowtie?  (mapping 190
        million reads
        took only 3-4 hours for me)
        Jose
On Fri, May 4, 2012 at 7:55 PM, Jennifer Jackson <jen@bx.psu.edu
        <mailto:jen@bx.psu.edu>
        <mailto:jen@bx.psu.edu <mailto:jen@bx.psu.edu>>> wrote:
Hello Jose,
It sounds as if the job is running out of memory. Since you are
            already working on a cloud, I am going to make the
        assumption that
            you have explored the server options with high-capacity memory
            there. But if not, that is one place to start, in particular
        your
            EC2 Instance type, as described on this wiki:
        http://wiki.g2.bx.psu.edu/____Admin/Cloud/CapacityPlanning
        <http://wiki.g2.bx.psu.edu/__Admin/Cloud/CapacityPlanning>
<http://wiki.g2.bx.psu.edu/__Admin/Cloud/CapacityPlanning
        <http://wiki.g2.bx.psu.edu/Admin/Cloud/CapacityPlanning>>
However, even if that was an option, you may want to consider
            running your in data through in another way - by running smaller
            jobs, then merging results, to avoid the large jobs. For
        example, in
            the last step where you join to the "full BamHI delimited
        bin file",
            instead join to groups of bins in that file (perhaps grouped by
            chromosome), then combine the results to produce the full
        output.
Hopefully this helps provide some options,
Jen
            Galaxy team
On 5/4/12 2:18 PM, Xianrong Wong wrote:
Hello,
                         I have binned the mouse genome into fragments
        based on
                restriction enzyme cut sites.  So each bin is a fragment
        flanked
                by say
                BamHI.  The output file is in the interval format: chr#
        start
                and end
                coordinates of each bin.  I want to count how many times
        each
                bin has
                reads that align to it.  I mapped my reads using bowtie and
                generated a
                dataset (interval format) for the aligned reads.  I then
        used
                join in
        "operate on genomic intervals" and asked it to return intervals that
                innerjoin the "bin file".  The subsequent steps involve
        grouping and
                counting and then joining back to the 1st dataset (BamHI
        delimited
                bins).  I have tried this workflow on small datasets and
        it worked.
                However when I subject my full alignment file and full BamHI
                delimited
                bin file, the tool fails.  I am doing this on cloud.
          Any advice
                would
                be appreciated!
Jose
_________________________________________________________________
The Galaxy User list should be used for the discussion of
                Galaxy analysis and other features on the public server
                at usegalaxy.org <http://usegalaxy.org>
        <http://usegalaxy.org>.  Please keep all
replies on the list by
                using "reply all" in your mail client.  For discussion of
                local Galaxy instances and the Galaxy source code, please
                use the Galaxy Development list:
http://lists.bx.psu.edu/____listinfo/galaxy-dev
        <http://lists.bx.psu.edu/__listinfo/galaxy-dev>
<http://lists.bx.psu.edu/__listinfo/galaxy-dev
        <http://lists.bx.psu.edu/listinfo/galaxy-dev>>
To manage your subscriptions to this and other Galaxy lists,
                please use the interface at:
http://lists.bx.psu.edu/
--
            Jennifer Jackson
        http://galaxyproject.org
--
    Jennifer Jackson
    http://galaxyproject.org
-- 
Jennifer Jackson
http://galaxyproject.org