Hi Richard, Please see some comments below. On Fri, Dec 10, 2010 at 8:59 PM, <anton@bx.psu.edu> wrote:
---------------------------- Original Message ---------------------------- Subject: Re: [galaxy-dev] Wiki erratum... From: "Richard Bruskiewich" <r.bruskiewich@irri.org> Date: Fri, December 10, 2010 7:17 pm To: "Anton Nekrutenko" <anton@bx.psu.edu> --------------------------------------------------------------------------
Hi Anton,
The power of open source: many eyes... Glad to be of help.
BTW, thank you for all your tutorial videos... they are excellent. I present them to my staff as an example of how to empower end users to they can work more independently.
I am located at the International Rice Research Institute (IRRI; www.irri.org) in the Philippines where I've been for over a decade working on rice genomics. Due to recent strategic research restructuring here, I now have the excuse, after years of senior research management, of simply being a bioinformatics hacker again. It's both fun and frustrating.
I'm just getting seriously started with Galaxy although I've known about the platform for some years now. It is a very exciting tool. I can't wait to put it to good use in our projects here.
In particular, NGS data sets are starting to pour in from many IRRI projects. Galaxy promises to make the analysis of such data tractable, documented and efficient.
In fact, in 2011, we may be resequencing up to 10,000 new rice genomes. Galaxy on the Amazon cloud is a godsend for this, although I'm patiently awaiting for the AMI to be cloned to the ap-southeast region in Singapore, where we do most of our computing deployments (since we are in Asia). I've also been told that the next release will have also the Michael Smith Genome Sciences Center ABySS assembler included... I'm keen on using that software within Galaxy.
On that note, a technical question about which I'm curious: does Galaxy configuration currently allow specific tools to run on specific sized instances? For example, if I fire up a Galaxy CloudMan cluster with a few large RAM Amazon instances/nodes, can I specifically request that specific software components (e.g. assemblers like Abyss) run only, or preferably, on those high capacity nodes?
Currently, Galaxy Cloud allows a cluster to be composed of multiple types of instances but the selection of which tool runs on which instance is handled by the job manager (i.e., SGE) and thus a specific job cannot be targeted at a specific instance type; we should eventually provide support for this type of functionality. In the mean time, a cluster can be composed of the type of instance that match the current workload type and then the type of instances can be changed as the type of workload changes. Also, Amazon has so-called "cluster" instances, and now GPU cluster
instances. Again, the same idea applies: can specific tools be told to only run on such a cluster instance? Further ahead, could Galaxy be configured to automatically start/stop specific instances only when needed (including cluster instances)?
Because MPI-type jobs are the only true beneficiaries of the cluster instances, but only a handful of bioinformatics software are actually implemented using MPI and because those instances require a different AMI, we do not currently have support for that type of instances - maybe down the line. Nonetheless, in the coming new version Galaxy Cloud (currently being tested), the application will be able to automatically scale the size of the cluster based on the current workload. Thanks for your interest, Enis
I know... probably forging recklessly ahead here. I hope to have a stronger computing science staff on board in a few months which may allow me to explore such topics more proactively, but I'm simply wondering about the state-of-the-art here.
I hope that once I get more familiar with the platform, that I'll be able to contribute back more. I'm configuring Galaxy to connect to rice genome data, and there are some other tools I think might be useful in the platform (for our work, anyhow) so I'll get them in, then share the configuration files with the community. Maybe the deeper I dig, the more useful I'll get :-).
Cheers Richard
-- *Richard Bruskiewich, PhD* Senior Scientist, Computational and Systems Biology Applications Team for Computational Genomics T.T. Chang Genetic Resources Center International Rice Research Institute
On Fri, Dec 10, 2010 at 9:50 PM, Anton Nekrutenko <anton@bx.psu.edu> wrote:
Richard:
This beauty was mine. Thanks for pointing this out. It is now fixed.
Thanks,
anton
On Dec 9, 2010, at 10:04 PM, Richard Bruskiewich wrote:
Galaxy Colleagues,
I don't know who is maintaining the Galaxy wiki page at http://bitbucket.org/galaxy/galaxy-central/wiki/NGSLocalSetup but I noticed that the Python script under the Megablast instructions has an error: the "defline" operation after the "line.startswith" should be moved *after* the if length > 0 statement, otherwise, the defline is reset incorrectly before the previous sequence is written out. This results in a frameshift in the FASTA header line identifiers (i.e. the current sequence gets the next sequence identifier).
I've commented out the erroneous defline below and added the right one:
import sys
length = 0 defline = '' seq = []
for line in sys.stdin : line = line.rstrip( '\r\n' )
if line.startswith( '>' ): # defline = line.split( "|" )[1] # defline should NOT be here if length > 0: print ">%s_%s" % ( defline, length )
print "\n".join( seq ) length = 0 seq = [] defline = line.split( "|" )[1] # defline should be here
else: seq.append( line )
length += len( line )
print ">%s_%s" % ( defline, length ) print "\n".join( seq )
While on the topic of this page, perhaps the software versions need to be revisited. Megablast has been superseded already by Blast+. Perhaps new releases of Galaxy should update this?
BTW, when is the new Galaxy release (cloud man AMI too...) coming out? I heard rumors that it was due this week.
Cheers Richard Bruskiewich
-- *Richard Bruskiewich, PhD* Senior Scientist, Computational and Systems Biology Applications Team for Computational Genomics T.T. Chang Genetic Resources Center International Rice Research Institute
_______________________________________________ galaxy-dev mailing list galaxy-dev@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-dev
Anton Nekrutenko http://nekrut.bx.psu.edu http://usegalaxy.org
_______________________________________________ galaxy-lab mailing list galaxy-lab@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-lab