April 2011 - galaxy-user - lists.galaxyproject.org

suggestion for multithreading
by Louise-Amélie Schmitt 08 Aug '11

08 Aug '11

Hello everyone, I'm using TORQUE with Galaxy, and we noticed that if a tool is multithreaded, the number of needed cores is not communicated to pbs, leading to job crashes if the required resources are not available when the job is submitted. Therefore I modified a little the code as follows in lib/galaxy/jobs/runners/pbs.py 256 # define PBS job options 257 attrs.append( dict( name = pbs.ATTR_N, value = str( "%s_%s_% s" % ( job_wrapper.job_id, job_wrapper.tool.id, job_wrapper.user ) ) ) ) 258 mt_file = open('tool-data/multithreading.csv', 'r') 259 for l in mt_file: 260 l = string.split(l) 261 if ( l[0] == job_wrapper.tool.id ): 262 attrs.append( dict( name = pbs.ATTR_l, resource = 'nodes', value = '1:ppn='+str(l[1]) ) ) 263 attrs.append( dict( name = pbs.ATTR_l, resource = 'mem', value = str(l[2]) ) ) 264 break 265 mt_file.close() 266 job_attrs = pbs.new_attropl( len( attrs ) + len( pbs_options ) ) (sorry it didn't come out very well due to line breaking) The csv file contains a list of the multithreaded tools, each line containing: <tool id>\t<number of threads>\t<memory needed>\n And it works fine, the jobs wait for their turn properly, but information is duplicated. Perhaps there would be a way to include something similar in galaxy's original code (if it is not already the case, I may not be up-to-date) without duplicating data. I hope that helps :) Best regards, L-A

7 15

RNA-seq Galaxy workflow for PE barcoded samples?
by Whyte, Jeffrey 13 Jul '11

13 Jul '11

Hello, I posted to the seqanswers forum, but have not received any feedback. I am working with RNA-seq Illumina data files in Galaxy (http://main.g2.bx.psu.edu/) The two files are 100bp paired-end reads, multiplexed with barcoding to distinguish samples. The barcodes are the first four bases of the sequences in the s_7_1_sequence.txt file. Would the following Galaxy workflow be correct? 1. Upload both s_7_1_sequence.txt and s_7_2_sequence.txt to Galaxy with the reference genome selected 2. Run NGS: QC and manipulation --> FASTQ Groomer on each file to convert to Sanger FASTQ 3. Run NGS: QC and manipulation --> FASTQ joiner to combine the data from the two files 4. Run FASTX-TOOLKIT FOR FASTQ DATA --> Barcode Splitter to generate separate FASTQ files for each barcode group 5. Run NGS: RNA Analysis --> Tophat to map the reads from each group to the reference genome The problem I am having is that if I select paired-end for the library in Tophat, it requests two FASTQ files. Would I have to use FASTQ Splitter to separate the joined FASTQ files? If there is a more standard way to handle these types of barcoded files, I would appreciate hearing about this workflow. Thanks very much in advance, jjw P.S. Galaxy is an incredibly useful resource. Thanks!

2 1

Question regarding quality filtering of 454 amplicons
by Jackie Lighten 01 Jul '11

01 Jul '11

Hi, I have a question for you guys regarding quality filtering. I have a data set of double MID tagged 454 amplicons, from which I wish to select high quality sequences above Q20. The 454 quality filtering system seems to work differently from that given for the Illumina sequencing i.e. 454 filtering takes high quality segments, while Illumina (FASTQ) can select high quality full reads based on certain parameters. OK, so I know that the total length of my amplicon, including primers and barcodes is around 260bp. If I then set the 454 quality filtering tool to extract contiguous high quality sequence of >260, it gives me back around 45% of my raw data as hitting this criterion i.e. All 260bp are above Q20. I don¹t necessarily need this high stringency as most bases may not be informative. But if I convert my 454 data to FASTQ format and then run the Illumina filtering system which also allows me to set the number of bases allowed to deviate from the Q20 criteria, I get back over 90% of my data (allowing 10bp to deviate from Q20). I then need to go ahead and convert back to 454 format. Can you tell me if this is OK? Will I loose /confuse information somewhere along these conversions? It seems that if I do this, my barcodes are removed, as amplicons do not sort properly when I parse them through my barcode filtering program. Does anyone know of a program to filter 454 data based on average sequence quality score, which doesn¹t involve Linux and the Roche off instrument program (I have no experience in Linux! ) Thanks! -- Jack Lighten, Ph.D. Candidate, Bentzen Lab, Room 6078, Department of Biology, Dalhousie University, Halifax, NS, B3H 4J1 Canada Office:(902) 494-1398 Email: Jackie.Lighten(a)Dal.Ca Profile: www.marinebiodiversity.ca/CHONe/Members/lightenj/profile/bio

2 1

MACS
by Sher, Falak 14 Jun '11

14 Jun '11

while using MSCS tool from Galaxy I get following mesg. with BED file: "Treatment tags and Control tags are uneven! FDR may be wrong" Any suggestion to fix it.. I don't understand its implication, my data is form single illumina chIP-Seq experiment. I use Bowtie from Galaxy for mapping. F

2 1

Flagstat on BAM files
by Slim Sassi 03 Jun '11

03 Jun '11

Hello, I tried to use NGS: SAM Tools ->flagstat on a BAM files for basic stats, but I got results like you see below. It doesn't seem to be working. Any suggestions? 26584869 in total 0 QC failure 0 duplicates 26584869 mapped (100.00%) 0 paired in sequencing 0 read1 0 read2 0 properly paired (-nan%) 0 with itself and mate mapped 0 singletons (-nan%) 0 with mate mapped to a different chr 0 with mate mapped to a different chr (mapQ>=5) Thanks Slim The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Partners Compliance HelpLine at http://www.partners.org/complianceline . If the e-mail was sent to you in error but does not contain patient information, please contact the sender and properly dispose of the e-mail.

6 11

Reload ".loc" files without restarting the Galaxy server?
by Luobin Yang 17 May '11

17 May '11

Hi all, I am wondering if it's possible to to reload a ".loc" file without restarting the Galaxy server? Thanks, Luobin

6 8

Galaxy's recent problems
by Dave Clements 06 May '11

06 May '11

Hello all, Galaxy has been ill recently due to disk space on the server filling up. In the short term we are working on freeing up space. In the longer term we are working on adding a quota system to limit how much space any one user can take, and we've also ordered a lot more disk. In the mean time, *please take a look at your saved histories and datasets and delete anything you no longer need*. This will greatly help us get through the current rough patch. In the longer term, you can also consider setting up your own local Galaxy instance, or running Galaxy on the cloud. See: http://getgalaxy.org/ https://bitbucket.org/galaxy/galaxy-central/wiki/cloud Thanks for your patience while we work this out, Dave C and the Galaxy Team -- http://galaxy.psu.edu/gcc2011/ http://getgalaxy.org http://usegalaxy.org/

3 5

Indel work flow questions
by Mike Dufault 03 May '11

03 May '11

Hi All, I have a question about the NGS: Indel analysis and SNP Calling. Assuming I have loaded my paired end reads, groomed, and got all the way through to alignment with BWA my question the becomes does the analysis for indel analysis and SNP analysis split in the work flow? For SNP analysis, It seems that I need to filter on SAM, convert SAM-to-BAM, etc... For Indel, It seem that I should use the BWA output that is in SAM format for indel analysis. Are these two above statments correct? I also have a question regarding the input for indel analysis. Should I use the BWA output directly (which is in SAM format) or should I first "filter on SAM" and use that output (which is also in SAM format). I have tried the indel analysis using both filtered and unfilterd and I get very similar results. It seems to me that should use the "filtered on SAM" output where I can indicate that the reads are paired=Yes, proper pairs=yes, unmapped=NO. Any thought, insight, etc. Thanks if advance, Mike

2 1

RNA seq analysis
by puvan001＠umn.edu 03 May '11

03 May '11

I am new to Galaxy and I am not sure whether these topics were discussed earlier. I followed the steps up to cufflinks and I did not have any problems. Thanks for the RNA seq tutorial. My questions are 1. How do I know the number of reads mapped against the reference genome used after Top Hat mapping 2. I am aware that Cuffdiff is used to find the differences in expression. How do I combine replicates (3) of different treatments ? SP

2 1

Only first 5000 reads are displayed in this tile
by Jeremy Chien 02 May '11

02 May '11

Hi During the visualization of my mRNAseq data, some area have red line indicating that only the first 5000 reads are displayed. Within this region, some area have many reads. Some area, where exons exist, I don't see any reads. How do I interpret the data? If there are no reads shown in the visualization, although there is a read line saying only the first 5000 reads are displayed, does the absence of reads corresponding to a particular exon means it is not expressed? Thanks, Jeremy

2 1