Hi, I am developing a tool for STAR (https://code.google.com/p/rna-star/), and I realize I may be reinventing another wheel. Has anyone else created a tool for STAR? There's nothing else in the toolsheds for it yet. David -------------------- David Hoover, PhD Helix Systems Staff SCB/DCSS/CIT/NIH 301-435-2986 http://helix.nih.gov

Hi David, yes there is inital code in the https://testtoolshed.g2.bx.psu.edu/. I think Ross has done some work on it. The main problem with Star is that is needs special indices (and a lot of it) and it would be great to offer data managers for it. Cheers, Bjoern Am 24.09.2014 um 22:05 schrieb David Hoover:

Why didn't I see these before? Hmm, I thought I had searched both toolsheds... I was kind of hoping someone had tackled this in a different way. It would be nice if there was a composite datatype for the reference genome. It is important for users to generate their own personal genome references, rather than rely on shared, admin-installed indices. And you're correct, we'd need at least 5-10 separate genome references for each organism, depending on read length and annotation GTF. Back to wheel reinvention. BTW, can you tell me which standardly installed tools use composite datatypes? It's always easier to build thing from comparison, rather than from scratch. David -------------------- David Hoover, PhD Helix Systems Staff On Sep 24, 2014, at 4:14 PM, Björn Grüning <bjoern.gruening@gmail.com> wrote:

Hi, Am 24.09.2014 um 22:39 schrieb David Hoover:
Why didn't I see these before? Hmm, I thought I had searched both toolsheds...
I was kind of hoping someone had tackled this in a different way. It would be nice if there was a composite datatype for the reference genome. It is important for users to generate their own personal genome references, rather than rely on shared, admin-installed indices. And you're correct, we'd need at least 5-10 separate genome references for each organism, depending on read length and annotation GTF.
Yes :( For me this was to cumbersome and I decided to wait until the TopHat successor will be released :)
Back to wheel reinvention.
BTW, can you tell me which standardly installed tools use composite datatypes? It's always easier to build thing from comparison, rather than from scratch.
Sure. Have a look at the Galaxy wrappers from Peter: https://github.com/peterjc/galaxy_blast Galaxy Datatypes are composite. Cheers, Bjoern

Hi David, -1 on the composite datatype for star reference genomes idea Let's use the existing reference genome and index file infrastructure - star is not sufficiently different from bwa or bowtie to warrant anything different IMHO - it works for us but generating the index files manually is a pain - best to write a data manager. Ross Lazarus Head, Computational Biology, Baker IDI, Melbourne, Australia Pubs: http://scholar.google.com/citations?hl=en&user=UCUuEM4AAAAJ On Thu, Sep 25, 2014 at 6:39 AM, David Hoover <hooverdm@helix.nih.gov> wrote:

Bjorn We'd be interested in this tool, as well. Any idea how close to functional it is? I see it's only on TEST toolshed, and not on production, at this point. I don't see any related Trello card when searching on "star" Regards, Curtis Galaxy Admin @ University of Alabama at Birmingham -----Original Message----- From: galaxy-dev-bounces@lists.bx.psu.edu [mailto:galaxy-dev-bounces@lists.bx.psu.edu] On Behalf Of Björn Grüning Sent: Wednesday, September 24, 2014 3:15 PM To: galaxy-dev@lists.bx.psu.edu; hooverdm@helix.nih.gov >> David Hoover Subject: Re: [galaxy-dev] tool for STAR RNA-seq aligner Hi David, yes there is inital code in the https://testtoolshed.g2.bx.psu.edu/. I think Ross has done some work on it. The main problem with Star is that is needs special indices (and a lot of it) and it would be great to offer data managers for it. Cheers, Bjoern Am 24.09.2014 um 22:05 schrieb David Hoover:
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/

Hi all, I do have a version 0.1 wrapper for STAR under our oqtans package, however it is not yet in toolshed. https://github.com/ratschlab/oqtans_tools/tree/master/STAR/2.3/ Feel free to comment, Vipin | Rätsch Lab On Wed, Sep 24, 2014 at 4:55 PM, Curtis Hendrickson (Campus) < curtish@uab.edu> wrote:

Hi All, That (fubar in testtoolshed) star wrapper was derived from one originally written by Jeremy Goecks. I modified it for multiple inputs and added a few tweaks and it has been in production use in our group for about 6 months so I'm pretty sure it works reasonably well in our hands at least. I would really appreciate any available help getting it to a proven useful state - suggestions and code welcomed. I have not moved it to the main toolshed because aside from some encouragement, I've had no feedback to suggest it's working - or not. It is extremely fast - we regularly see 200-300M reads per minute in the logs! We regularly run a whole experiment worth (eg 12 - 24) fastq files simultaneously with the shared memory option working on our cluster - see the readme. Star index files made with a gene model (requires valid gff3) are huge - 20-30GB for hg19 - hence the need for shared memory if you run multiple jobs. That will eventually become a serious problem if you really want to allow users to make their own - we definitely do not. You need to be very careful about matching the gene model gff3 file to the reference and I had enough trouble getting it right for the few major genomes we use to make me think that I do not want users trying to do that generating 25GB of rubbish every time they get it wrong. There are challenges to do with needing different indexes for different length reads but we are seeing fairly consistent 60bp single ended reads for most of the incoming RNA seq experiments. A data manager would be a boon if anyone cares to write one... On Thu, Sep 25, 2014 at 6:55 AM, Curtis Hendrickson (Campus) < curtish@uab.edu> wrote:

Ross, About the index files: It is way easier to have pre-built index files. However, when running a 2-pass STAR run, a user will need to generate their own reference index files based on the output SJ.tab.out file created in the first pass. Is this incorporated into your tool? About shared memory: I am under the impression that the latest version of STAR has deprecated this feature. I am unclear how this would help unless a single large-memory machine was dedicated to running all STAR jobs. Is this the case? Also, does the tool merge the SAM/BAM file with the output chimeric SAM file? David Hoover On 9/24/2014 7:03 PM, Ross wrote:

Hi David. I've not needed that workflow so haven't a solution for you and no, it doesn't do anything with chimeric output - won't be hard to add I suspect. There's no python wrapper - just shell script in the command segment. It's not in an IUC main tool shed repository because it lacks a data manager - manual star indexes are a bit of a pain but less pain than writing a data manager :( so I haven't yet. Might be run best through the API. On shared memory: Pity. it works a treat for us. I didn't see anything on the google group - do you recall where you learned about this deprecation ? On Thu, Sep 25, 2014 at 10:41 PM, David Hoover <hooverdm@helix.nih.gov> wrote:

A colleague of mine mentioned it. I'll ask him where he got his info. Just to clarify: do you always run STAR jobs on the same host? We are running Galaxy in front of a batch system cluster, and so by default STAR jobs would run on different nodes. It's not clear to me how long the memory allocated would last after the batch job finished. How do you determine whether the memory remains allocated and whether the job has been accelerated due to pre-loaded data? For example, if you create a genome reference, using --genomeLoad=LoadAndKeep, then run an alignment, are subsequent alignments using the same genome reference much faster? If so, how much faster? I apologize, I am a jack of all trades, master of none. I could test this myself, but everything I touch related to genomics takes >50GB of memory and >18 hours clocktime, and it gets painful to try testing everything. David -------------------- David Hoover, PhD Helix Systems Staff SCB/DCSS/CIT/NIH 301-435-2986 http://helix.nih.gov On Sep 26, 2014, at 4:13 AM, Ross <ross.lazarus@gmail.com> wrote:

OK....hmm.... My colleague can't recall where he heard it. Maybe he made it up, or it was a misunderstanding about a recent patch that Alex (author of STAR) issued. Never mind... In any case, would we need to specify a single large memory node to utilize this feature for STAR? How long does the memory allocation last? David -------------------- David Hoover, PhD Helix Systems Staff SCB/DCSS/CIT/NIH 301-435-2986 http://helix.nih.gov On Sep 26, 2014, at 9:21 AM, David Hoover <hooverdm@helix.nih.gov> wrote:
participants (5)
Björn Grüning
Curtis Hendrickson (Campus)
David Hoover
Vipin TS