merging fastq files
Hi I have received Illumina paired-end genome sequence data as a .tar file. When unpacked the data for each genome accession is split into about 100 fastq files. Total of about 37 Gpb per genome. Can you recommend the best way to organise this data prior to mapping to reference genome? I can concatenate unpacked files using DOS command line into forward and reverse before uploading: is this the best approach? Is there a tools that will start with the .tar file? Andrew
Hi Andrew, Merging the data prior to upload would probably be simplest. Files in a galaxy history are not in .tar format at this time. Loading forward and reverse separately will most likely be important from a scientific perspective for analysis. Once ready for upload, you can tar or gz - as long as each load is a single file - or leave uncompressed - either is fine. Using FTP is required for larger data (>= 2G) and using a client that will allow you to track progress/resume an interrupted load can be helpful. Each file can be up to 50G in size if you have an account. http://wiki.galaxyproject.org/FTPUpload Hopefully this helps, Jen Galaxy team On 4/5/13 3:20 AM, Thompson, Andrew wrote:
Hi
I have received Illumina paired-end genome sequence data as a .tar file. When unpacked the data for each genome accession is split into about 100 fastq files. Total of about 37 Gpb per genome.
Can you recommend the best way to organise this data prior to mapping to reference genome?
I can concatenate unpacked files using DOS command line into forward and reverse before uploading: is this the best approach? Is there a tools that will start with the .tar file?
Andrew
___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists, please use the interface at:
To search Galaxy mailing lists use the unified search at:
-- Jennifer Hillman-Jackson Galaxy Support and Training http://galaxyproject.org
Dear Jen Thanks. I have merged the files and end up with 4 x 47 G fastq files for read mapping to the reference. It seems this is too much data to analyse on the public main instance if the size limit is 250 G? So I tried to set up the cloud option following the screencast (http://screencast.g2.bx.psu.edu/cloud/), but when I search for the current AMI name (861460482541/galaxy-cloudman-2011-03-22) it is not found in the list of community AMIs under Amazon's EC2 Management Console. Any ideas why this is not working? regards Andrew ________________________________ From: Jennifer Jackson [jen@bx.psu.edu] Sent: 08 April 2013 21:21 To: Thompson, Andrew Cc: 'galaxy-user@lists.bx.psu.edu' Subject: Re: [galaxy-user] merging fastq files Hi Andrew, Merging the data prior to upload would probably be simplest. Files in a galaxy history are not in .tar format at this time. Loading forward and reverse separately will most likely be important from a scientific perspective for analysis. Once ready for upload, you can tar or gz - as long as each load is a single file - or leave uncompressed - either is fine. Using FTP is required for larger data (>= 2G) and using a client that will allow you to track progress/resume an interrupted load can be helpful. Each file can be up to 50G in size if you have an account. http://wiki.galaxyproject.org/FTPUpload Hopefully this helps, Jen Galaxy team On 4/5/13 3:20 AM, Thompson, Andrew wrote: Hi I have received Illumina paired-end genome sequence data as a .tar file. When unpacked the data for each genome accession is split into about 100 fastq files. Total of about 37 Gpb per genome. Can you recommend the best way to organise this data prior to mapping to reference genome? I can concatenate unpacked files using DOS command line into forward and reverse before uploading: is this the best approach? Is there a tools that will start with the .tar file? Andrew ___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/ -- Jennifer Hillman-Jackson Galaxy Support and Training http://galaxyproject.org
Hi Andrew, My first guess is that perhaps the region is not set correctly? http://wiki.galaxyproject.org/CloudMan/AWS/GettingStarted See " Step 1: One Time Amazon Setup", subsection 2, where region is " set your AWS Region to US East (Virginia)". The image in the wiki for step 1.2 is slightly outdated, instead it will look like this: Please give this a try and let us know if you continue to have issues. Thanks! Jen Galaxy team On 4/8/13 3:56 PM, Thompson, Andrew wrote:
Dear Jen Thanks. I have merged the files and end up with 4 x 47 G fastq files for read mapping to the reference. It seems this is too much data to analyse on the public main instance if the size limit is 250 G? So I tried to set up the cloud option following the screencast (http://screencast.g2.bx.psu.edu/cloud/), but when I search for the current AMI name (861460482541/galaxy-cloudman-2011-03-22) it is not found in the list of community AMIs under Amazon's EC2 Management Console. Any ideas why this is not working? regards Andrew
-- Jennifer Hillman-Jackson Galaxy Support and Training http://galaxyproject.org
Dear Jen. Yes, that was my problem, I skipped some steps by relying too much on the screencast and ignoring the text. Now I am reluctant to launch the AMI as I having trouble estimating my usage and costs on AWS - as a new user I have little idea what to set many of the parameters in the usage calculator. Are there any examples of typical parameters and costs for running Galaxy on the cloud? My first task is to map about 80 gbp of total paired end reads from genomic DNA from two accessions to a 900 mbp reference genome and then find SNPs and INDELs. A ball-park figure would be reassuring! regards Andrew ________________________________ From: Jennifer Jackson [jen@bx.psu.edu] Sent: 09 April 2013 15:11 To: Thompson, Andrew Cc: 'galaxy-user@lists.bx.psu.edu' Subject: Re: [galaxy-user] merging fastq files Hi Andrew, My first guess is that perhaps the region is not set correctly? http://wiki.galaxyproject.org/CloudMan/AWS/GettingStarted See " Step 1: One Time Amazon Setup", subsection 2, where region is " set your AWS Region to US East (Virginia)". The image in the wiki for step 1.2 is slightly outdated, instead it will look like this: [cid:part2.09010203.02020409@bx.psu.edu] Please give this a try and let us know if you continue to have issues. Thanks! Jen Galaxy team On 4/8/13 3:56 PM, Thompson, Andrew wrote: Dear Jen Thanks. I have merged the files and end up with 4 x 47 G fastq files for read mapping to the reference. It seems this is too much data to analyse on the public main instance if the size limit is 250 G? So I tried to set up the cloud option following the screencast (http://screencast.g2.bx.psu.edu/cloud/), but when I search for the current AMI name (861460482541/galaxy-cloudman-2011-03-22) it is not found in the list of community AMIs under Amazon's EC2 Management Console. Any ideas why this is not working? regards Andrew -- Jennifer Hillman-Jackson Galaxy Support and Training http://galaxyproject.org
Hi Andrew, We do not have any estimates posted on the wiki currently for example usage on the cloud, but this is a good idea and the team is discussing the best way to add in something like this. The difficulty comes with how variable actual job run-times can be, but there still are some ways to break this down. These examples are based on how *long* an instance would be up and center on two primary costs: the type of instance and the size of the EBS volume. The details at Amazon are on this link: aws.amazon.com/ec2/pricing <http://aws.amazon.com/ec2/pricing> 1 extra large high memory instance capable of RNA-seq w/ 200GB storage: $25/day. + 1 worker instance, $10/day/each. 1 basic instance capable of general text manipulation w/ 50GB storage: $10/day I am not sure if you will be using GATK or SAM Tools for your processing, but running any variant analysis pipeline would be somewhat similar to an RNA-seq pipeline since it would involve mapping, large data file manipulations, etc. For you particular case, the data storage would be larger than the estimate above, so using the table at Amazon should help you to calculate a figure that reflects your storage needs. It is difficult to say how long any job will run purely based on the size of the inputs, as content and parameter settings have a significant effect on run time, but after the first job, or first time through a complete workflow, if the data is somewhat homogenous, you may be able to estimate a total from there for future runs. Although I or almost anyone else can tell you that these sorts of experiments can pop out with surprises now and then! Others on the list using a cloud are welcomed to post comments to this thread. Once we get the initial wiki table posted, it will be open to community input, so that this type of actual usage data can be captured. If you or anyone else also wants to send back results meanwhile (post to thread and/or ticket, with experiment & instance detail) please do, here is the new development -> https://trello.com/c/pMbri7QI Hopefully this helps a little bit! Apologies for not being able to give more detail, this is a tough question to answer with precision for a complete workflow! A pool of case examples is probably the best way to get a bead on this data, so that's part of the goal now. Jen Galaxy team On 4/10/13 9:20 AM, Thompson, Andrew wrote:
Dear Jen. Yes, that was my problem, I skipped some steps by relying too much on the screencast and ignoring the text.
Now I am reluctant to launch the AMI as I having trouble estimating my usage and costs on AWS - as a new user I have little idea what to set many of the parameters in the usage calculator. Are there any examples of typical parameters and costs for running Galaxy on the cloud? My first task is to map about 80 gbp of total paired end reads from genomic DNA from two accessions to a 900 mbp reference genome and then find SNPs and INDELs. A ball-park figure would be reassuring!
regards Andrew ________________________________ From: Jennifer Jackson [jen@bx.psu.edu] Sent: 09 April 2013 15:11 To: Thompson, Andrew Cc: 'galaxy-user@lists.bx.psu.edu' Subject: Re: [galaxy-user] merging fastq files
Hi Andrew,
My first guess is that perhaps the region is not set correctly? http://wiki.galaxyproject.org/CloudMan/AWS/GettingStarted
See " Step 1: One Time Amazon Setup", subsection 2, where region is " set your AWS Region to US East (Virginia)". The image in the wiki for step 1.2 is slightly outdated, instead it will look like this:
[cid:part2.09010203.02020409@bx.psu.edu]
Please give this a try and let us know if you continue to have issues.
Thanks!
Jen Galaxy team
On 4/8/13 3:56 PM, Thompson, Andrew wrote:
Dear Jen Thanks. I have merged the files and end up with 4 x 47 G fastq files for read mapping to the reference. It seems this is too much data to analyse on the public main instance if the size limit is 250 G? So I tried to set up the cloud option following the screencast (http://screencast.g2.bx.psu.edu/cloud/), but when I search for the current AMI name (861460482541/galaxy-cloudman-2011-03-22) it is not found in the list of community AMIs under Amazon's EC2 Management Console. Any ideas why this is not working? regards Andrew
-- Jennifer Hillman-Jackson Galaxy Support and Training http://galaxyproject.org
-- Jennifer Hillman-Jackson Galaxy Support and Training http://galaxyproject.org
participants (2)
-
Jennifer Jackson
-
Thompson, Andrew