Getting up to speed on Galaxy and couldn't find examples or discussion related to the architecture and was hoping an expert could give some quick pointers/guidance. Where do I find info if the installed applications make use of multiple nodes via MPI(etc) which would indicate the benefit of starting up X number of nodes for faster processing? If a workflow has multiple initial inputs for say processing NGS exome data from tumor and blood(gets compared later in the workflow) will each step get sent to a different node(without a dependency) or will the entire workflow run on one node? If I have NGS data for 20 patients sitting in a S3 bucket and want a specific workflow run against each patient data input(s) does this require manual selection of files by a user or can the workflow be automated? Can I programmatically start a workflow remotely(via REST) where I have automated the process of uploading NGS data to S3 and know the input file(s) per workflow? Is it possible to present credentials in a workflow for downloading a file via S3 where I require authentication before a file can be downloaded? Working with NGS data for patients so trying to understand how I can keep security tight. Currently planning on restricting download to IP address for the cluster but gets a little complicated for what amazon is doing behind the scenes in its internal network. I would also like to push results/output back to S3 and didn't see anything obvious to do this. Gets a little complicated in that you would need to probably put results back in the same S3 bucket/new folder where the original source files came from. I saw mention of using scp to move files but that doesn't help to put results back in S3. So far I really like what I have seen and hope Galaxy becomes the future toolbox for our work. Does a roadmap exist for what is planned in the future? For example any additional tools NGS tools like Abyss going to make into the build? Interested in NGS software that handles the dynamics of cancer for gene fusion events, CNVs(etc) when dealing with NGS data. Thanks Scooter
Where do I find info if the installed applications make use of multiple nodes via MPI(etc) which would indicate the benefit of starting up X number of nodes for faster processing?
You'll need to look at the individual tool documentation. In general, many tools uses multiple cores, few use MPI for multi-node computing.
If a workflow has multiple initial inputs for say processing NGS exome data from tumor and blood(gets compared later in the workflow) will each step get sent to a different node(without a dependency) or will the entire workflow run on one node?
If you've set up Galaxy to use a job scheduler (e.g. SGE/PBS), multiple nodes can be used. Multiple nodes will be used on the cloud: http://wiki.g2.bx.psu.edu/CloudMan
If I have NGS data for 20 patients sitting in a S3 bucket and want a specific workflow run against each patient data input(s) does this require manual selection of files by a user or can the workflow be automated?
Automation via the API is possible; unfortunately, most API documentation is in the Py/Sphinx docs for now, so you'll have to dig and/or use the sample scripts in <galaxy_dir>/scripts/api
Can I programmatically start a workflow remotely(via REST) where I have automated the process of uploading NGS data to S3 and know the input file(s) per workflow?
Yes.
Is it possible to present credentials in a workflow for downloading a file via S3 where I require authentication before a file can be downloaded?
You can restrict dataset access using role-based security.
Does a roadmap exist for what is planned in the future?
Roadmap at a very high level is in this presentation: http://wiki.g2.bx.psu.edu/Documents/Presentations/GCC2012?action=AttachFile&do=get&target=State.pdf
For example any additional tools NGS tools like Abyss going to make into the build?
The framework is being separated from tools. The best place to look for tools is in the toolshed, where there is an abyss wrapper: http://toolshed.g2.bx.psu.edu/
Interested in NGS software that handles the dynamics of cancer for gene fusion events, CNVs(etc) when dealing with NGS data.
There is active work on cancer tools for Galaxy. Keeping an eye on the toolshed is a good idea here. Best, J.
participants (2)
-
Jeremy Goecks
-
Scooter Willis