1 new changeset in galaxy-central: http://bitbucket.org/galaxy/galaxy-central/changeset/9f4956c0d94a/ changeset: r5590:9f4956c0d94a user: an...@grusha.bx.psu.edu date: 2011-05-20 19:26:45 summary: Fixes to picard tool intefaces and help sections. Disabled java heap selection by the user (set to 4G default, which was changed in picard_wrapper.py). affected #: 13 files (7.0 KB) --- a/tools/picard/picard_AddOrReplaceReadGroups.xml Fri May 20 06:17:35 2011 +0100 +++ b/tools/picard/picard_AddOrReplaceReadGroups.xml Fri May 20 13:26:45 2011 -0400 @@ -18,13 +18,13 @@ -j "${GALAXY_DATA_INDEX_DIR}/shared/jars/AddOrReplaceReadGroups.jar" </command><inputs> - <param format="bam,sam" name="inputFile" type="data" label="Input: sam or bam format short read data in your current history" - help="If the select list is empty, you need to upload or import some aligned short read data from a shared library" /> - <param name="rglb" value="" type="text" label="Read group library" /> - <param name="rgpl" value="" type="text" label="Read group platform" help="illumina, solid, etc." /> + <param format="bam,sam" name="inputFile" type="data" label="SAM/BAM dataset to add or replace read groups in" + help="If empty, upload or import a SAM/BAM dataset." /> + <param name="rgid" value="1" type="text" label="Read group ID (ID tag)" help="The most important read group tag. Galaxy will use a value of '1' if nothing provided." /> + <param name="rgsm" value="" type="text" label="Read group sample name (SM tag)" /> + <param name="rglb" value="" type="text" label="Read group library (LB tag)" /> + <param name="rgpl" value="" type="text" label="Read group platform (PL tag)" help="illumina, solid, 454, pacbio, helicos" /><param name="rgpu" value="" type="text" label="Read group platform unit" help="like run barcode, etc." /> - <param name="rgsm" value="" type="text" label="Read group sample name" /> - <param name="rgid" value="1" type="text" label="Read group ID" help="Picard will use a value of '1' if nothing provided" /><conditional name="readGroupOpts"><param name="rgOpts" type="select" label="Specify additional (optional) arguments" help="Allows you to set RGCN and RGDS."><option value="preSet">Use pre-set defaults</option> @@ -98,7 +98,44 @@ **Purpose** -Add or Replace Read Groups in an input bam or sam file. +Add or Replace Read Groups in an input BAM or SAM file. + +**Read Groups are Important!** + +Many downstream analysis tools (such as GATK, for example) require BAM datasets to contain read groups. Even if you are not going to use GATK, setting read groups correctly from the start will simplify your life greatly. Below we provide an explanation of read groups fields taken from GATK FAQ webpage: + +.. csv-table:: + :header-rows: 1 + + Tag,Importance,Definition,Meaning + "ID","Required","Read group identifier. Each @RG line must have a unique ID. The value of ID is used in the RG tags of alignment records. Must be unique among all read groups in header section. Read group IDs may be modified when merging SAM files in order to handle collisions.","Ideally, this should be a globally unique identify across all sequencing data in the world, such as the Illumina flowcell + lane name and number. Will be referenced by each read with the RG:Z field, allowing tools to determine the read group information associated with each read, including the sample from which the read came. Also, a read group is effectively treated as a separate run of the NGS instrument in tools like base quality score recalibration (a GATK component) -- all reads within a read group are assumed to come from the same instrument run and to therefore share the same error model." + "SM","Sample. Use pool name where a pool is being sequenced.","Required. As important as ID.","The name of the sample sequenced in this read group. GATK tools treat all read groups with the same SM value as containing sequencing data for the same sample. Therefore it's critical that the SM field be correctly specified, especially when using multi-sample tools like the Unified Genotyper (a GATK component)." + "PL","Platform/technology used to produce the read. Valid values: ILLUMINA, SOLID, LS454, HELICOS and PACBIO.","Important. Not currently used in the GATK, but was in the past, and may return. The only way to known the sequencing technology used to generate the sequencing data","It's a good idea to use this field." + "LB","DNA preparation library identify","Essential for MarkDuplicates","MarkDuplicates uses the LB field to determine which read groups might contain molecular duplicates, in case the same DNA library was sequenced on multiple lanes." + +**Example of Read Group usage** + +Support we have a trio of samples: MOM, DAD, and KID. Each has two DNA libraries prepared, one with 400 bp inserts and another with 200 bp inserts. Each of these libraries is run on two lanes of an illumina hiseq, requiring 3 x 2 x 2 = 12 lanes of data. When the data come off the sequencer, we would create 12 BAM files, with the following @RG fields in the header:: + + Dad's data: + @RG ID:FLOWCELL1.LANE1 PL:illumina LB:LIB-DAD-1 SM:DAD PI:200 + @RG ID:FLOWCELL1.LANE2 PL:illumina LB:LIB-DAD-1 SM:DAD PI:200 + @RG ID:FLOWCELL1.LANE3 PL:illumina LB:LIB-DAD-2 SM:DAD PI:400 + @RG ID:FLOWCELL1.LANE4 PL:illumina LB:LIB-DAD-2 SM:DAD PI:400 + + Mom's data: + @RG ID:FLOWCELL1.LANE5 PL:illumina LB:LIB-MOM-1 SM:MOM PI:200 + @RG ID:FLOWCELL1.LANE6 PL:illumina LB:LIB-MOM-1 SM:MOM PI:200 + @RG ID:FLOWCELL1.LANE7 PL:illumina LB:LIB-MOM-2 SM:MOM PI:400 + @RG ID:FLOWCELL1.LANE8 PL:illumina LB:LIB-MOM-2 SM:MOM PI:400 + + Kid's data: + @RG ID:FLOWCELL2.LANE1 PL:illumina LB:LIB-KID-1 SM:KID PI:200 + @RG ID:FLOWCELL2.LANE2 PL:illumina LB:LIB-KID-1 SM:KID PI:200 + @RG ID:FLOWCELL2.LANE3 PL:illumina LB:LIB-KID-2 SM:KID PI:400 + @RG ID:FLOWCELL2.LANE4 PL:illumina LB:LIB-KID-2 SM:KID PI:400 + +Note the hierarchical relationship between read groups (unique for each lane) to libraries (sequenced on two lanes) and samples (across four lanes, two lanes for each library). **Picard documentation** @@ -143,12 +180,12 @@ .. class:: warningmark -**Warning on Sam quality** +**Warning on SAM/BAM quality** -Unfortunately some packages seem perfectly capable of producing sam and bam files -that Picard will be picky about otherwise. Galaxy deals with this by using the lenient -flag, which allows reads to be discarded if they're empty or don't map. This appears -to be the only way to deal with sam that cannot be parsed. +Many SAM/BAM files produced externally and uploaded to Galaxy do not fully conform to SAM/BAM specifications. Galaxy deals with this by using the **LENIENT** +flag when it runs Picard, which allows reads to be discarded if they're empty or don't map. This appears +to be the only way to deal with SAM/BAM that cannot be parsed. + </help> --- a/tools/picard/picard_BamIndexStats.xml Fri May 20 06:17:35 2011 +0100 +++ b/tools/picard/picard_BamIndexStats.xml Fri May 20 13:26:45 2011 -0400 @@ -9,8 +9,8 @@ -j "${GALAXY_DATA_INDEX_DIR}/shared/jars/BamIndexStats.jar" </command><inputs> - <param format="bam" name="input_file" type="data" label="Input: sam or bam format short read data in your current history" - help="If the select list is empty, you need to upload or import some aligned short read data from a shared library" /> + <param format="bam" name="input_file" type="data" label="BAM dataset to generate statistics for" + help="If empty, upload or import a BAM dataset" /></inputs><outputs><data format="html" name="htmlfile" label="${tool.name}_on_${on_string}.html" /> @@ -39,7 +39,7 @@ **Purpose** -Generate Bam Index Stats for a provided bam file. +Generate Bam Index Stats for a provided BAM file. **Picard documentation** @@ -53,26 +53,25 @@ **Inputs and outputs** -The only input is the bam file you wish to obtain statistics for, which is required. -Note that it must be coordinate-sorted. Galaxy currently coordinate-sorts all bam files. +The only input is the BAM file you wish to obtain statistics for, which is required. +Note that it must be coordinate-sorted. Galaxy currently coordinate-sorts all BAM files. This tool outputs an HTML file that contains links to the actual metrics results, as well as a log file with info on the exact command run. .. class:: warningmark -**Warning on Sam quality** +**Warning on SAM/BAM quality** -Unfortunately some packages seem perfectly capable of producing sam and bam files -that Picard will be picky about otherwise. Galaxy deals with this by using the lenient -flag, which allows reads to be discarded if they're empty or don't map. This appears -to be the only way to deal with sam that cannot be parsed. +Many SAM/BAM files produced externally and uploaded to Galaxy do not fully conform to SAM/BAM specifications. Galaxy deals with this by using the **LENIENT** +flag when it runs Picard, which allows reads to be discarded if they're empty or don't map. This appears +to be the only way to deal with SAM/BAM that cannot be parsed. ------ **Example** -Given a bam file created from the following:: +Given a BAM file created from the following:: @HD VN:1.0 SO:coordinate @SQ SN:chr1 LN:101 --- a/tools/picard/picard_MarkDuplicates.xml Fri May 20 06:17:35 2011 +0100 +++ b/tools/picard/picard_MarkDuplicates.xml Fri May 20 13:26:45 2011 -0400 @@ -24,7 +24,7 @@ --picard-cmd="MarkDuplicates" </command><inputs> - <param format="bam,sam" name="input_file" type="data" label="The sam- or bam-format short read data in your current history" + <param format="bam,sam" name="input_file" type="data" label="SAM/BAM dataset to mark duplicates in" help="If the select list is empty, you need to upload or import some aligned short read data from a shared library"/><param name="remDups" type="boolean" label="Remove duplicates from output file" truevalue="true" falsevalue="false" checked="False" help="If true do not write duplicates to the output file instead of writing them with appropriate flags set" /> --- a/tools/picard/picard_ReorderSam.xml Fri May 20 06:17:35 2011 +0100 +++ b/tools/picard/picard_ReorderSam.xml Fri May 20 13:26:45 2011 -0400 @@ -1,4 +1,4 @@ -<tool name="Reorder SAM" id="picard_ReorderSam" version="0.3.0"> +<tool name="Reorder SAM/BAM" id="picard_ReorderSam" version="0.3.0"><requirements><requirement type="package">picard</requirement></requirements><command interpreter="python"> picard_wrapper.py @@ -18,10 +18,10 @@ -j "${GALAXY_DATA_INDEX_DIR}/shared/jars/ReorderSam.jar" </command><inputs> - <param format="bam,sam" name="inputFile" type="data" label="The sam or bam file in your current history whose header you want to replace" - help="If the select list is empty, you need to upload or import some aligned short read data from a shared library" /> + <param format="bam,sam" name="inputFile" type="data" label="SAM/BAM dataset to be reordered" + help="If empty, upload or import a SAM/BAM dataset." /><conditional name="source"> - <param name="indexSource" type="select" label="Choose the source for the reference list"> + <param name="indexSource" type="select" label="Select Reference Genome" help="This tool will re-order SAM/BAM in the same order as reference selected below."><option value="built-in">Locally cached</option><option value="history">History</option></param> @@ -39,7 +39,7 @@ </conditional><param name="allowIncDictConcord" type="boolean" checked="False" truevalue="true" falsevalue="false" label="Allow incomplete dict concordance?" help="Allows a partial overlap of the BAM contigs with the new reference sequence contigs." /><param name="allowContigLenDiscord" type="boolean" checked="False" truevalue="true" falsevalue="false" label="Allow contig length discordance?" help="This is dangerous--don't check it unless you know exactly what you're doing!" /> - <param name="outputFormat" type="boolean" checked="True" truevalue="bam" falsevalue="sam" label="Output bam instead of sam" help="Uncheck for sam output" /> + <param name="outputFormat" type="boolean" checked="True" truevalue="bam" falsevalue="sam" label="Output BAM instead of SAM" help="Uncheck for SAM output" /></inputs><outputs><data name="outFile" format="bam" label="${tool.name} on ${on_string}: reordered ${outputFormat}"> @@ -106,10 +106,10 @@ **Purpose** -Reorder Sam to match contig ordering in a particular reference file. Note that this is +Reorder SAM/BAM to match contig ordering in a particular reference file. Note that this is not the same as sorting as done by the SortSam tool, which sorts by either coordinate values or query name. The ordering in ReorderSam is based on exact name matching of -contigs. Reads that are mapped to a contig that is not in the new reference file are +contigs/chromosomes. Reads that are mapped to a contig that is not in the new reference file are not included in the output. **Picard documentation** @@ -142,12 +142,11 @@ .. class:: warningmark -**Warning on Sam quality** +**Warning on SAM/BAM quality** -Unfortunately some packages seem perfectly capable of producing sam and bam files -that Picard will be picky about otherwise. Galaxy deals with this by using the lenient -flag, which allows reads to be discarded if they're empty or don't map. This appears -to be the only way to deal with sam that cannot be parsed. +Many SAM/BAM files produced externally and uploaded to Galaxy do not fully conform to SAM/BAM specifications. Galaxy deals with this by using the **LENIENT** +flag when it runs Picard, which allows reads to be discarded if they're empty or don't map. This appears +to be the only way to deal with SAM/BAM that cannot be parsed. </help> --- a/tools/picard/picard_ReplaceSamHeader.xml Fri May 20 06:17:35 2011 +0100 +++ b/tools/picard/picard_ReplaceSamHeader.xml Fri May 20 13:26:45 2011 -0400 @@ -1,4 +1,4 @@ -<tool name="Replace Sam Header" id="picard_ReplaceSamHeader" version="0.2.0"> +<tool name="Replace SAM/BAM Header" id="picard_ReplaceSamHeader" version="0.2.0"><requirements><requirement type="package">picard</requirement></requirements><command interpreter="python"> picard_wrapper.py @@ -10,11 +10,11 @@ --tmpdir "${__new_file_path__}" </command><inputs> - <param format="bam,sam" name="inputFile" type="data" label="Input: sam or bam format short read data in your current history whose header will be replaced" - help="If the select list is empty, you need to upload or import some aligned short read data from a shared library" /> - <param format="bam,sam" name="headerFile" type="data" label="sam or bam file from which header will be read" - help="If the select list is empty, you need to upload or import some aligned short read data from a shared library" /> - <param name="outputFormat" type="boolean" checked="True" truevalue="bam" falsevalue="sam" label="Output bam instead of sam" help="Uncheck for sam output" /> + <param format="bam,sam" name="inputFile" type="data" label="SAM/BAM dataset to replace header in (TARGET)" + help="If empty, upload or import a SAM/BAM dataset." /> + <param format="bam,sam" name="headerFile" type="data" label="SAM/BAM to reader header from (SOURCE)" + help="If empty, upload or import a SAM/BAM dataset." /> + <param name="outputFormat" type="boolean" checked="True" truevalue="bam" falsevalue="sam" label="Output BAM instead of SAM" help="Uncheck for SAM output" /></inputs><outputs><data name="outFile" format="bam" label="${tool.name} on ${on_string}: ${outputFormat} with replaced header"> @@ -91,12 +91,12 @@ .. class:: warningmark -**Warning on Sam quality** +**Warning on SAM/BAM quality** -Unfortunately some packages seem perfectly capable of producing sam and bam files -that Picard will be picky about otherwise. Galaxy deals with this by using the lenient -flag, which allows reads to be discarded if they're empty or don't map. This appears -to be the only way to deal with sam that cannot be parsed. +Many SAM/BAM files produced externally and uploaded to Galaxy do not fully conform to SAM/BAM specifications. Galaxy deals with this by using the **LENIENT** +flag when it runs Picard, which allows reads to be discarded if they're empty or don't map. This appears +to be the only way to deal with SAM/BAM that cannot be parsed. + </help> --- a/tools/picard/picard_wrapper.py Fri May 20 06:17:35 2011 +0100 +++ b/tools/picard/picard_wrapper.py Fri May 20 13:26:45 2011 -0400 @@ -361,7 +361,7 @@ op.add_option('-n', '--title', default="Pick a Picard Tool") op.add_option('-t', '--htmlout', default=None) op.add_option('-d', '--outdir', default=None) - op.add_option('-x', '--maxjheap', default='2g') + op.add_option('-x', '--maxjheap', default='4g') op.add_option('-b', '--bisulphite', default='false') op.add_option('-s', '--sortorder', default='query') op.add_option('','--tmpdir', default='/tmp') --- a/tools/picard/rgPicardASMetrics.xml Fri May 20 06:17:35 2011 +0100 +++ b/tools/picard/rgPicardASMetrics.xml Fri May 20 13:26:45 2011 -0400 @@ -1,4 +1,4 @@ -<tool name="Sam/bam Alignment Summary Metrics" id="PicardASMetrics" version="0.03"> +<tool name="SAM/BAM Alignment Summary Metrics" id="PicardASMetrics" version="0.03"><command interpreter="python"> picard_wrapper.py -i "$input_file" -d "$html_file.files_path" -t "$html_file" --assumesorted "$sorted" -b "$bisulphite" --adaptors "$adaptors" --maxinsert "$maxinsert" -n "$out_prefix" @@ -11,21 +11,20 @@ </command><requirements><requirement type="package">picard</requirement></requirements><inputs> - <param format="sam,bam" name="input_file" type="data" label="Input: sam or bam format short read data in your current history" - help="If the select list is empty, you need to upload or import some aligned short read data from a shared library"/> + <param format="sam,bam" name="input_file" type="data" label="SAM/BAM dataset to generate statistics for" + help="If empty, upload or import a SAM/BAM dataset."/><param name="out_prefix" value="Picard Alignment Summary Metrics" type="text" - label="Title for the output file - use this remind you what the job was for" size="80" /> + label="Title for the output file" help="Use this remind you what the job was for." size="80" /><conditional name="genomeSource"> - <param name="refGenomeSource" type="select" help="This tool needs a reference genome. Default is usually right" - label="Align to which reference genome? - default, built-in or choose from current history"> - <option value="default" selected="true">Default - use the input data genome/build</option> + <param name="refGenomeSource" type="select" label="Select Reference Genome"> + <option value="default" selected="true">Use the assigned data genome/build</option><option value="indexed">Select a different built-in genome</option><option value="history">Use a genome (fasta format) from my history</option></param><when value="default"> - <param name="index" type="select" label="Select a default reference genome"> + <param name="index" type="select" label="Check the assigned reference genome" help="Galaxy thinks that the reads in you dataset were aligned against this reference. If this is not correct, use the 'Select a build-in reference genome' option of the 'Select Reference Genome' dropdown to select approprtiate Reference."><options from_data_table="all_fasta"><filter type="data_meta" ref="input_file" key="dbkey" column="dbkey" multiple="True" separator="," /><validator type="no_options" message="No reference build available for selected input" /> @@ -33,18 +32,18 @@ </param></when><when value="indexed"> - <param name="index" type="select" label="Select a built-in reference genome" > + <param name="index" type="select" label="Select a built-in reference genome" help="This list contains genomes cached at this Galaxy instance. If your genome of interest is not present here request it by using 'Help' link at the top of Galaxy interface or use the 'Use a genome (fasta format) from my history' option of the 'Select Reference Genome' dropdown."><options from_data_table="all_fasta"></options></param></when><when value="history"> - <param name="ownFile" type="data" format="fasta" metadata_name="dbkey" label="Select a reference genome from history" /> + <param name="ownFile" type="data" format="fasta" metadata_name="dbkey" label="Select a reference genome from history" help="This option works best for relatively small genomes. If you are working with large human-sized genomes, send request to Galaxy team for adding your reference to this Galaxy instance by using 'Help' link at the top of Galaxy interface."/></when></conditional><param name="sorted" type="boolean" label="Assume the input file is already sorted" checked="true" truevalue="true" falsevalue="false"/><param name="bisulphite" type="boolean" label="Input file contains Bisulphite sequenced reads" checked="false" falsevalue="false" truevalue="true" /> - <param name="adaptors" value="" type="text" area="true" label="Adapter sequences - one per line if multiple" size="5x120" /> + <param name="adaptors" value="" type="text" area="true" label="Adapter sequences" help="One per line if multiple" size="5x120" /><param name="maxinsert" value="100000" type="integer" label="Larger paired end reads and inter-chromosomal pairs considered chimeric " size="20" /></inputs><outputs> @@ -94,12 +93,17 @@ **Syntax** -- **Input** is sam/bam format aligned short read data in your current history -- **Title** is the title to use for all output files from this job - use it for high level metadata -- **Refseq** is the sequence you want to interogate - eg hg19 -- **Assume Sorted** saves sorting time - but only if true! -- **Bisulphite data** see Picard documentation http://picard.sourceforge.net/command-line-overview.shtml#CollectAlignmentSu... -- **Maximum acceptable insertion length** See Picard documentation at http://picard.sourceforge.net/command-line-overview.shtml#CollectAlignmentSu... +- **Input** - SAM/BAM format aligned short read data in your current history +- **Title** - the title to use for all output files from this job - use it for high level metadata +- **Reference Genome** - Galaxy (and Picard) needs to know which genomic reference was used to generate alignemnts within the input SAM/BAM dataset. Here you have three choices: + + - *Assigned data genome/build* - a genome specified for this dataset. If you your SAM/BAM dataset has an assigned reference genome it will be displayed below this dropdown. If it does not -> use one of the following two options. + - *Select a different built-in genome* - this option will list all reference genomes presently cached at this instance of Galaxy. + - *Select a reference genome from history* - alternatively you can upload your own version of reference genome into your history and use it with this option. This is however not advisable with large human-sized genomes. If your genome is large contact Galaxy team using "Help" link at the top of the interface and provide exact details on where we can download sequences you would like to use as the refenece. We will then install them as a part of locally cached genomic references. + +- **Assume Sorted** - saves sorting time - but only if true! +- **Bisulphite data** - see Picard documentation http://picard.sourceforge.net/command-line-overview.shtml#CollectAlignmentSu... +- **Maximum acceptable insertion length** - see Picard documentation at http://picard.sourceforge.net/command-line-overview.shtml#CollectAlignmentSu... ----- @@ -109,7 +113,7 @@ The Picard documentation (reformatted for Galaxy) says: -.. csv-table:: ASMDoc +.. csv-table:: :header-rows: 1 Option,Description @@ -122,46 +126,35 @@ "IS_BISULFITE_SEQUENCED=Boolean","Whether the SAM or BAM file consists of bisulfite sequenced reads. Default value: false. " "CREATE_MD5_FILE=Boolean","Whether to create an MD5 digest for any BAM files created." -The output produced by the tool has the following columns: +The output produced by the tool has the following columns:: -#. CATEGORY: One of either UNPAIRED (for a fragment run), FIRST_OF_PAIR when metrics are for only the first read in a paired run, SECOND_OF_PAIR when the metrics are for only the second read in a paired run or PAIR when the metrics are aggregeted for both first and second reads in a pair. -#. TOTAL_READS: The total number of reads including all PF and non-PF reads. When CATEGORY equals PAIR this value will be 2x the number of clusters. -#. PF_READS: The number of PF reads where PF is defined as passing Illumina's filter. -#. PCT_PF_READS: The percentage of reads that are PF (PF_READS / TOTAL_READS) -#. PF_NOISE_READS: The number of PF reads that are marked as noise reads. A noise read is one which is composed entirey of A bases and/or N bases. These reads are marked as they are usually artifactual and are of no use in downstream analysis. -#. PF_READS_ALIGNED: The number of PF reads that were aligned to the reference sequence. This includes reads that aligned with low quality (i.e. their alignments are ambiguous). -#. PCT_PF_READS_ALIGNED: The percentage of PF reads that aligned to the reference sequence. PF_READS_ALIGNED / PF_READS -#. PF_HQ_ALIGNED_READS: The number of PF reads that were aligned to the reference sequence with a mapping quality of Q20 or higher signifying that the aligner estimates a 1/100 (or smaller) chance that the alignment is wrong. -#. PF_HQ_ALIGNED_BASES: The number of bases aligned to the reference sequence in reads that were mapped at high quality. Will usually approximate PF_HQ_ALIGNED_READS * READ_LENGTH but may differ when either mixed read lengths are present or many reads are aligned with gaps. -#. PF_HQ_ALIGNED_Q20_BASES: The subest of PF_HQ_ALIGNED_BASES where the base call quality was Q20 or higher. -#. PF_HQ_MEDIAN_MISMATCHES: The median number of mismatches versus the reference sequence in reads that were aligned to the reference at high quality (i.e. PF_HQ_ALIGNED READS). -#. PF_HQ_ERROR_RATE: The percentage of bases that mismatch the reference in PF HQ aligned reads. -#. MEAN_READ_LENGTH: The mean read length of the set of reads examined. When looking at the data for a single lane with equal length reads this number is just the read length. When looking at data for merged lanes with differing read lengths this is the mean read length of all reads. -#. READS_ALIGNED_IN_PAIRS: The number of aligned reads who's mate pair was also aligned to the reference. -#. PCT_READS_ALIGNED_IN_PAIRS: The percentage of reads who's mate pair was also aligned to the reference. READS_ALIGNED_IN_PAIRS / PF_READS_ALIGNED -#. BAD_CYCLES: The number of instrument cycles in which 80% or more of base calls were no-calls. -#. STRAND_BALANCE: The number of PF reads aligned to the positive strand of the genome divided by the number of PF reads aligned to the genome. -#. PCT_CHIMERAS: The percentage of reads that map outside of a maximum insert size (usually 100kb) or that have the two ends mapping to different chromosomes. -#. PCT_ADAPTER: The percentage of PF reads that are unaligned and match to a known adapter sequence right from the start of the read. + 1. CATEGORY: One of either UNPAIRED (for a fragment run), FIRST_OF_PAIR when metrics are for only the first read in a paired run, SECOND_OF_PAIR when the metrics are for only the second read in a paired run or PAIR when the metrics are aggregeted for both first and second reads in a pair. + 2. TOTAL_READS: The total number of reads including all PF and non-PF reads. When CATEGORY equals PAIR this value will be 2x the number of clusters. + 3. PF_READS: The number of PF reads where PF is defined as passing Illumina's filter. + 4. PCT_PF_READS: The percentage of reads that are PF (PF_READS / TOTAL_READS) + 5. PF_NOISE_READS: The number of PF reads that are marked as noise reads. A noise read is one which is composed entirey of A bases and/or N bases. These reads are marked as they are usually artifactual and are of no use in downstream analysis. + 6. PF_READS_ALIGNED: The number of PF reads that were aligned to the reference sequence. This includes reads that aligned with low quality (i.e. their alignments are ambiguous). + 7. PCT_PF_READS_ALIGNED: The percentage of PF reads that aligned to the reference sequence. PF_READS_ALIGNED / PF_READS + 8. PF_HQ_ALIGNED_READS: The number of PF reads that were aligned to the reference sequence with a mapping quality of Q20 or higher signifying that the aligner estimates a 1/100 (or smaller) chance that the alignment is wrong. + 9. PF_HQ_ALIGNED_BASES: The number of bases aligned to the reference sequence in reads that were mapped at high quality. Will usually approximate PF_HQ_ALIGNED_READS * READ_LENGTH but may differ when either mixed read lengths are present or many reads are aligned with gaps. + 10. PF_HQ_ALIGNED_Q20_BASES: The subest of PF_HQ_ALIGNED_BASES where the base call quality was Q20 or higher. + 11. PF_HQ_MEDIAN_MISMATCHES: The median number of mismatches versus the reference sequence in reads that were aligned to the reference at high quality (i.e. PF_HQ_ALIGNED READS). + 12. PF_HQ_ERROR_RATE: The percentage of bases that mismatch the reference in PF HQ aligned reads. + 13. MEAN_READ_LENGTH: The mean read length of the set of reads examined. When looking at the data for a single lane with equal length reads this number is just the read length. When looking at data for merged lanes with differing read lengths this is the mean read length of all reads. + 14. READS_ALIGNED_IN_PAIRS: The number of aligned reads who's mate pair was also aligned to the reference. + 15. PCT_READS_ALIGNED_IN_PAIRS: The percentage of reads who's mate pair was also aligned to the reference. READS_ALIGNED_IN_PAIRS / PF_READS_ALIGNED + 16. BAD_CYCLES: The number of instrument cycles in which 80% or more of base calls were no-calls. + 17. STRAND_BALANCE: The number of PF reads aligned to the positive strand of the genome divided by the number of PF reads aligned to the genome. + 18. PCT_CHIMERAS: The percentage of reads that map outside of a maximum insert size (usually 100kb) or that have the two ends mapping to different chromosomes. + 19. PCT_ADAPTER: The percentage of PF reads that are unaligned and match to a known adapter sequence right from the start of the read. .. class:: warningmark -**Warning on Sam quality** +**Warning on SAM/BAM quality** -Unfortunately some packages seem perfectly capable of producing sam and bam files -that Picard will be picky about otherwise. Galaxy deals with this by using the lenient -flag, which allows reads to be discarded if they're empty or don't map. This appears -to be the only way to deal with sam that cannot be parsed. - ------ - -.. class:: infomark - -**Typical tool invocation without Galaxy is on a command line - eg:** - -java -jar /share/shared/galaxy/tool-data/shared/jars/CollectAlignmentSummaryMetrics.jar REFERENCE_SEQUENCE="hg18.fasta" ASSUME_SORTED=true ADAPTER_SEQUENCE='' IS_BISULFITE_SEQUENCED=false INPUT=test.bam OUTPUT=picardASMetrics.txt VALIDATION_STRINGENCY=LENIENT - -Note that last parameter - your life will be far easier if you use it. +Many SAM/BAM files produced externally and uploaded to Galaxy do not fully conform to SAM/BAM specifications. Galaxy deals with this by using the **LENIENT** +flag when it runs Picard, which allows reads to be discarded if they're empty or don't map. This appears +to be the only way to deal with SAM/BAM that cannot be parsed. </help> --- a/tools/picard/rgPicardFixMate.xml Fri May 20 06:17:35 2011 +0100 +++ b/tools/picard/rgPicardFixMate.xml Fri May 20 13:26:45 2011 -0400 @@ -6,8 +6,8 @@ </command><requirements><requirement type="package">picard</requirement></requirements><inputs> - <param format="bam,sam" name="input_file" type="data" label="Input: sam or bam format short read data in your current history" - help="If the select list is empty, you need to upload or import some aligned short read data from a shared library"/> + <param format="bam,sam" name="input_file" type="data" label="SAM/BAM dataset to fix" + help="If empty, upload or import a SAM/BAM dataset."/><param name="sortOrder" type="select" help="If in doubt, leave as default and read Picard/Samtools documentation" label="Sort order"><option value="coordinate" selected ="true">Coordinate sort</option> @@ -15,8 +15,8 @@ <option value="unsorted">Unsorted - docs not clear if this means unchanged or not</option></param><param name="out_prefix" value="Fix Mate" type="text" - label="Title for the output file - use this remind you what the job was for" size="80" /> - <param name="outputFormat" type="boolean" checked="True" truevalue="bam" falsevalue="sam" label="Output bam instead of sam" help="Uncheck for sam output" /> + label="Title for the output file" help="Use this remind you what the job was for." size="80" /> + <param name="outputFormat" type="boolean" checked="True" truevalue="bam" falsevalue="sam" label="Output BAM instead of SAM" help="Uncheck for SAM output" /></inputs><outputs><data format="bam" name="out_file" label="${tool.name} on ${on_string}: ${outputFormat} with fixed mates"> @@ -61,8 +61,8 @@ **Useful for paired data only** Likely won't do anything helpful for single end sequence data -Currently, Galaxy doesn't distinguish paired from single ended sam/bam so make sure -the data you choose are valid (paired end) sam or bam data - unless you trust this +Currently, Galaxy doesn't distinguish paired from single ended SAM/BAM so make sure +the data you choose are valid (paired end) SAM or BAM data - unless you trust this tool not to harm your data. ----- @@ -71,10 +71,10 @@ **Syntax** -- **Input** is a paired read sam/bam format aligned short read data in your current history -- **Sort order** can be used to adjust the ordering of reads -- **Title** is the title to use for all output files from this job - use it for high level metadata -- **Output Format** is either sam or compressed as bam +- **Input** - a paired read sam/bam format aligned short read data in your current history +- **Sort order** - can be used to adjust the ordering of reads +- **Title** - the title to use for all output files from this job - use it for high level metadata +- **Output Format** - either SAM or compressed as BAM ----- @@ -82,7 +82,7 @@ **Inputs, outputs, and parameters** -.. csv-table:: Fixmate +.. csv-table:: :header-rows: 1 @@ -94,12 +94,11 @@ .. class:: warningmark -**Warning on Sam quality** +**Warning on SAM/BAM quality** -Unfortunately some packages seem perfectly capable of producing sam and bam files -that Picard will be picky about otherwise. Galaxy deals with this by using the lenient -flag, which allows reads to be discarded if they're empty or don't map. This appears -to be the only way to deal with sam that cannot be parsed. +Many SAM/BAM files produced externally and uploaded to Galaxy do not fully conform to SAM/BAM specifications. Galaxy deals with this by using the **LENIENT** +flag when it runs Picard, which allows reads to be discarded if they're empty or don't map. This appears +to be the only way to deal with SAM/BAM that cannot be parsed. </help> --- a/tools/picard/rgPicardGCBiasMetrics.xml Fri May 20 06:17:35 2011 +0100 +++ b/tools/picard/rgPicardGCBiasMetrics.xml Fri May 20 13:26:45 2011 -0400 @@ -1,30 +1,28 @@ -<tool name="Sam/bam GC Bias Metrics" id="PicardGCBiasMetrics" version="0.01"> +<tool name="SAM/BAM GC Bias Metrics" id="PicardGCBiasMetrics" version="0.01"><command interpreter="python"> picard_wrapper.py -i "$input_file" -d "$html_file.files_path" -t "$html_file" --windowsize "$windowsize" --mingenomefrac "$mingenomefrac" -n "$out_prefix" --tmpdir "${__new_file_path__}" - -j ${GALAXY_DATA_INDEX_DIR}/shared/jars/CollectGcBiasMetrics.jar -x "$maxheap" + -j ${GALAXY_DATA_INDEX_DIR}/shared/jars/CollectGcBiasMetrics.jar #if $genomeSource.refGenomeSource == "history": --ref-file "$genomeSource.ownFile" #else: --ref "${ filter( lambda x: str( x[0] ) == str( $genomeSource.index ), $__app__.tool_data_tables[ 'all_fasta' ].get_fields() )[0][-1] }" #end if - </command><requirements><requirement type="package">picard</requirement></requirements><inputs> - <param format="sam,bam" name="input_file" type="data" label="Input: sam or bam format short read data in your current history" - help="If the select list is empty, you need to upload or import some aligned short read data from a shared library"/> + <param format="sam,bam" name="input_file" type="data" label="SAM/BAM dataset to generateGC bias metrics" + help="If empty, upload or import a SAM/BAM dataset."/><param name="out_prefix" value="Short Read GC Bias Metrics" type="text" - label="Title for the output file - use this remind you what the job was for" size="80" /> + label="Title for the output file" help="Use this remind you what the job was for." size="80" /><conditional name="genomeSource"> - <param name="refGenomeSource" type="select" help="This tool needs a reference genome. Default is usually right" - label="Align to which reference genome? - default, built-in or choose from current history"> - <option value="default" selected="true">Default - use the input data genome build</option> - <option value="indexed">Select a different built-in genome </option> - <option value="history">Use a genome (fasta format) from my history </option> + <param name="refGenomeSource" type="select" label="Select Reference Genome"> + <option value="default" selected="true">Use the assigned data genome/build</option> + <option value="indexed">Select a different built-in genome</option> + <option value="history">Use a genome (fasta format) from my history</option></param><when value="default"> - <param name="index" type="select" label="Select a default reference genome"> + <param name="index" type="select" label="Check the assigned reference genome" help="Galaxy thinks that the reads in you dataset were aligned against this reference. If this is not correct, use the 'Select a build-in reference genome' option of the 'Select Reference Genome' dropdown to select approprtiate Reference."><options from_data_table="all_fasta"><filter type="data_meta" ref="input_file" key="dbkey" column="dbkey" multiple="True" separator=","/><validator type="no_options" message="No reference build available for the selected input data" /> @@ -32,18 +30,23 @@ </param></when><when value="indexed"> - <param name="index" type="select" label="Select a built-in reference genome"> + <param name="index" type="select" label="Select a built-in reference genome" help="This list contains genomes cached at this Galaxy instance. If your genome of interest is not present here request it by using 'Help' link at the top of Galaxy interface or use the 'Use a genome (fasta format) from my history' option of the 'Select Reference Genome' dropdown."><options from_data_table="all_fasta"/></param></when><when value="history"> - <param name="ownFile" type="data" format="fasta" metadata_name="dbkey" label="Select a reference genome from history" /> + <param name="ownFile" type="data" format="fasta" metadata_name="dbkey" label="Select a reference genome from history" help="This option works best for relatively small genomes. If you are working with large human-sized genomes, send request to Galaxy team for adding your reference to this Galaxy instance by using 'Help' link at the top of Galaxy interface."/></when></conditional><param name="windowsize" type="integer" label="GC minimum window size" value="100" - help="The size of windows on the genome that are used to bin reads. Default value: 100"/> + help="The size of windows on the genome that are used to bin reads. Default value: 100."/><param name="mingenomefrac" value="0.00001" type="float" label="Minimum Genome Fraction" help="For summary metrics, exclude GC windows that include less than this fraction of the genome. Default value: 1.0E-5." /> + <!-- + + Users can be enabled to set Java heap size by uncommenting this option and adding '-x "$maxheap"' to the <command> tag. + If commented out the heapsize defaults to the value specified within picard_wrapper.py + <param name="maxheap" type="select" help="If in doubt, choose 8G and read Picard documentation please" label="Java heap size"><option value="1G">1GB: very small data</option> @@ -52,6 +55,8 @@ <option value="8G" >8GB use if 4GB fails</option><option value="16G">16GB - try this if 8GB fails</option></param> + + --></inputs><outputs> @@ -90,9 +95,14 @@ **Syntax** -- **Input** is sam/bam format aligned short read data in your current history -- **Title** is the title to use for all output files from this job - use it for high level metadata -- **Refseq** is the sequence you want to interogate - eg hg19 - can be Galaxy built-in or a special one from your history +- **Input** - SAM/BAM format aligned short read data in your current history +- **Title** - the title to use for all output files from this job - use it for high level metadata +- **Reference Genome** - Galaxy (and Picard) needs to know which genomic reference was used to generate alignemnts within the input SAM/BAM dataset. Here you have three choices: + + - *Assigned data genome/build* - a genome specified for this dataset. If you your SAM/BAM dataset has an assigned reference genome it will be displayed below this dropdown. If it does not -> use one of the following two options. + - *Select a different built-in genome* - this option will list all reference genomes presently cached at this instance of Galaxy. + - *Select a reference genome from history* - alternatively you can upload your own version of reference genome into your history and use it with this option. This is however not advisable with large human-sized genomes. If your genome is large contact Galaxy team using "Help" link at the top of the interface and provide exact details on where we can download sequences you would like to use as the refenece. We will then install them as a part of locally cached genomic references. + - **Window Size** see Picard documentation http://picard.sourceforge.net/command-line-overview.shtml#CollectGCBiasMetri... - **Minimum Genome Fraction** See Picard documentation at http://picard.sourceforge.net/command-line-overview.shtml#CollectGCBiasMetri... @@ -104,7 +114,7 @@ The Picard documentation (reformatted for Galaxy) says: -.. csv-table:: GC Bias Doc +.. csv-table:: :header-rows: 1 Option,Description @@ -117,33 +127,22 @@ "MINIMUM_GENOME_FRACTION=Double","For summary metrics, exclude GC windows that include less than this fraction of the genome. Default value: 1.0E-5." "CREATE_MD5_FILE=Boolean","Whether to create an MD5 digest for any BAM files created. Default value: false." -The output produced by the tool has the following columns: +The output produced by the tool has the following columns:: -#. GC: The G+C content of the reference sequence represented by this bin. Values are from 0% to 100% -#. WINDOWS: The number of windows on the reference genome that have this G+C content. -#. READ_STARTS: The number of reads who's start position is at the start of a window of this GC. -#. MEAN_BASE_QUALITY: The mean quality (determined via the error rate) of all bases of all reads that are assigned to windows of this GC. -#. NORMALIZED_COVERAGE: The ration of "coverage" in this GC bin vs. the mean coverage of all GC bins. A number of 1 represents mean coverage, a number less than one represents lower than mean coverage (e.g. 0.5 means half as much coverage as average) while a number greater than one represents higher than mean coverage (e.g. 3.1 means this GC bin has 3.1 times more reads per window than average). -#. ERROR_BAR_WIDTH: The radius of error bars in this bin based on the number of observations made. For example if the normalized coverage is 0.75 and the error bar width is 0.1 then the error bars would be drawn from 0.65 to 0.85. + 1. GC: The G+C content of the reference sequence represented by this bin. Values are from 0% to 100% + 2. WINDOWS: The number of windows on the reference genome that have this G+C content. + 3. READ_STARTS: The number of reads who's start position is at the start of a window of this GC. + 4. MEAN_BASE_QUALITY: The mean quality (determined via the error rate) of all bases of all reads that are assigned to windows of this GC. + 5. NORMALIZED_COVERAGE: The ration of "coverage" in this GC bin vs. the mean coverage of all GC bins. A number of 1 represents mean coverage, a number less than one represents lower than mean coverage (e.g. 0.5 means half as much coverage as average) while a number greater than one represents higher than mean coverage (e.g. 3.1 means this GC bin has 3.1 times more reads per window than average). + 6. ERROR_BAR_WIDTH: The radius of error bars in this bin based on the number of observations made. For example if the normalized coverage is 0.75 and the error bar width is 0.1 then the error bars would be drawn from 0.65 to 0.85. .. class:: warningmark -**Warning on Sam quality** +**Warning on SAM/BAM quality** -Unfortunately some packages seem perfectly capable of producing sam and bam files -that Picard will be picky about otherwise. Galaxy deals with this by using the lenient -flag, which allows reads to be discarded if they're empty or don't map. This appears -to be the only way to deal with sam that cannot be parsed. - ------ - -.. class:: infomark - -**Typical tool invocation without Galaxy is on a command line - eg:** - -java -jar /share/shared/galaxy/tool-data/shared/jars/CollectGcBiasMetrics.jar REFERENCE_SEQUENCE="hg18.fasta" -MINIMUM_GENOME_FRACTION=0.00001 INPUT=test.bam OUTPUT=picardASMetrics.txt OUTPUT=test.txt CHART_OUTPUT=test.pdf -WINDOW_SIZE=100 VALIDATION_STRINGENCY=LENIENT +Many SAM/BAM files produced externally and uploaded to Galaxy do not fully conform to SAM/BAM specifications. Galaxy deals with this by using the **LENIENT** +flag when it runs Picard, which allows reads to be discarded if they're empty or don't map. This appears +to be the only way to deal with SAM/BAM that cannot be parsed. </help></tool> --- a/tools/picard/rgPicardHsMetrics.xml Fri May 20 06:17:35 2011 +0100 +++ b/tools/picard/rgPicardHsMetrics.xml Fri May 20 13:26:45 2011 -0400 @@ -1,25 +1,32 @@ -<tool name="Sam/bam Hybrid Selection Metrics" id="PicardHsMetrics" version="0.01"> - <description>For (eg exome) targeted data</description> +<tool name="SAM/BAM Hybrid Selection Metrics" id="PicardHsMetrics" version="0.01"> + <description>for targeted resequencing data</description><command interpreter="python"> picard_wrapper.py -i "$input_file" -d "$html_file.files_path" -t "$html_file" --datatype "$input_file.ext" --baitbed "$bait_bed" --targetbed "$target_bed" -n "$out_prefix" --tmpdir "${__new_file_path__}" - -j "${GALAXY_DATA_INDEX_DIR}/shared/jars/CalculateHsMetrics.jar" -x "$maxheap" + -j "${GALAXY_DATA_INDEX_DIR}/shared/jars/CalculateHsMetrics.jar" </command><requirements><requirement type="package">picard</requirement></requirements><inputs> - <param format="sam,bam" name="input_file" type="data" label="Sam or bam format short read from your current history" /> - <param name="out_prefix" value="Picard HS Metrics" type="text" label="Title for the output file - to remind you what the job was for" size="80" /> - <param name="bait_bed" type="data" format="interval" label="Bait intervals: Sequences for bait in the design - ucsc BED" size="80" /> - <param name="target_bed" type="data" format="interval" label="Target intervals: Sequences for targets in the design - ucsc BED" size="80" /> + <param format="sam,bam" name="input_file" type="data" label="SAM/BAM dataset to generate statistics for" /> + <param name="out_prefix" value="Picard HS Metrics" type="text" label="Title for the output file" help="Use to remind you what the job was for." size="80" /> + <param name="bait_bed" type="data" format="interval" label="Bait intervals: Sequences for bait in the design" help="In UCSC BED format" size="80" /> + <param name="target_bed" type="data" format="interval" label="Target intervals: Sequences for targets in the design" help="In UCSC BED format" size="80" /> + <!-- + + Users can be enabled to set Java heap size by uncommenting this option and adding '-x "$maxheap"' to the <command> tag. + If commented out the heapsize defaults to the value specified within picard_wrapper.py + <param name="maxheap" type="select" help="If in doubt, try the default. If it fails with a complaint about java heap size, try increasing it please - larger jobs will require your own hardware." label="Java heap size"><option value="4G" selected = "true">4GB default </option><option value="8G" >8GB use if 4GB fails</option><option value="16G">16GB - try this if 8GB fails</option> - </param> + </param> + + --></inputs><outputs><data format="html" name="html_file" label="${out_prefix}.html" /> @@ -44,12 +51,9 @@ **Picard documentation** -This is a Galaxy wrapper for CalculateHsMetrics_, a part of the external package Picard-tools_, which is supported by the SAMTools_ project. +This is a Galaxy wrapper for CollectAlignmentSummaryMetrics, a part of the external package Picard-tools_. - .. _CalculateHsMetrics: http://picard.sourceforge.net/command-line-overview.shtml#CalculateHsMetrics - .. _Picard-tools: http://picard.sourceforge.net/index.shtml - .. _SAMTools: http://samtools.sourceforge.net/ - + .. _Picard-tools: http://www.google.com/search?q=picard+samtools ----- @@ -61,7 +65,7 @@ Calculates a set of Hybrid Selection specific metrics from an aligned SAM or BAM file. -.. csv-table:: HsDoc +.. csv-table:: :header-rows: 1 "Option", "Description" @@ -76,61 +80,52 @@ The set of metrics captured that are specific to a hybrid selection analysis. -Output Column Definitions +Output Column Definitions:: -#. BAIT_SET: The name of the bait set used in the hybrid selection. -#. GENOME_SIZE: The number of bases in the reference genome used for alignment. -#. BAIT_TERRITORY: The number of bases which have one or more baits on top of them. -#. TARGET_TERRITORY: The unique number of target bases in the experiment where target is usually exons etc. -#. BAIT_DESIGN_EFFICIENCY: Target terrirtoy / bait territory. 1 == perfectly efficient, 0.5 = half of baited bases are not target. -#. TOTAL_READS: The total number of reads in the SAM or BAM file examine. -#. PF_READS: The number of reads that pass the vendor's filter. -#. PF_UNIQUE_READS: The number of PF reads that are not marked as duplicates. -#. PCT_PF_READS: PF reads / total reads. The percent of reads passing filter. -#. PCT_PF_UQ_READS: PF Unique Reads / Total Reads. -#. PF_UQ_READS_ALIGNED: The number of PF unique reads that are aligned with mapping score > 0 to the reference genome. -#. PCT_PF_UQ_READS_ALIGNED: PF Reads Aligned / PF Reads. -#. PF_UQ_BASES_ALIGNED: The number of bases in the PF aligned reads that are mapped to a reference base. Accounts for clipping and gaps. -#. ON_BAIT_BASES: The number of PF aligned bases that mapped to a baited region of the genome. -#. NEAR_BAIT_BASES: The number of PF aligned bases that mapped to within a fixed interval of a baited region, but not on a baited region. -#. OFF_BAIT_BASES: The number of PF aligned bases that mapped to neither on or near a bait. -#. ON_TARGET_BASES: The number of PF aligned bases that mapped to a targetted region of the genome. -#. PCT_SELECTED_BASES: On+Near Bait Bases / PF Bases Aligned. -#. PCT_OFF_BAIT: The percentage of aligned PF bases that mapped neither on or near a bait. -#. ON_BAIT_VS_SELECTED: The percentage of on+near bait bases that are on as opposed to near. -#. MEAN_BAIT_COVERAGE: The mean coverage of all baits in the experiment. -#. MEAN_TARGET_COVERAGE: The mean coverage of targets that recieved at least coverage depth = 2 at one base. -#. PCT_USABLE_BASES_ON_BAIT: The number of aligned, de-duped, on-bait bases out of the PF bases available. -#. PCT_USABLE_BASES_ON_TARGET: The number of aligned, de-duped, on-target bases out of the PF bases available. -#. FOLD_ENRICHMENT: The fold by which the baited region has been amplified above genomic background. -#. ZERO_CVG_TARGETS_PCT: The number of targets that did not reach coverage=2 over any base. -#. FOLD_80_BASE_PENALTY: The fold over-coverage necessary to raise 80% of bases in "non-zero-cvg" targets to the mean coverage level in those targets. -#. PCT_TARGET_BASES_2X: The percentage of ALL target bases acheiving 2X or greater coverage. -#. PCT_TARGET_BASES_10X: The percentage of ALL target bases acheiving 10X or greater coverage. -#. PCT_TARGET_BASES_20X: The percentage of ALL target bases acheiving 20X or greater coverage. -#. PCT_TARGET_BASES_30X: The percentage of ALL target bases acheiving 30X or greater coverage. -#. HS_LIBRARY_SIZE: The estimated number of unique molecules in the selected part of the library. -#. HS_PENALTY_10X: The "hybrid selection penalty" incurred to get 80% of target bases to 10X. This metric should be interpreted as: if I have a design with 10 megabases of target, and want to get 10X coverage I need to sequence until PF_ALIGNED_BASES = 10^6 * 10 * HS_PENALTY_10X. -#. HS_PENALTY_20X: The "hybrid selection penalty" incurred to get 80% of target bases to 20X. This metric should be interpreted as: if I have a design with 10 megabases of target, and want to get 20X coverage I need to sequence until PF_ALIGNED_BASES = 10^6 * 20 * HS_PENALTY_20X. -#. HS_PENALTY_30X: The "hybrid selection penalty" incurred to get 80% of target bases to 10X. This metric should be interpreted as: if I have a design with 10 megabases of target, and want to get 30X coverage I need to sequence until PF_ALIGNED_BASES = 10^6 * 30 * HS_PENALTY_30X. + 1. BAIT_SET: The name of the bait set used in the hybrid selection. + 2. GENOME_SIZE: The number of bases in the reference genome used for alignment. + 3. BAIT_TERRITORY: The number of bases which have one or more baits on top of them. + 4. TARGET_TERRITORY: The unique number of target bases in the experiment where target is usually exons etc. + 5. BAIT_DESIGN_EFFICIENCY: Target terrirtoy / bait territory. 1 == perfectly efficient, 0.5 = half of baited bases are not target. + 6. TOTAL_READS: The total number of reads in the SAM or BAM file examine. + 7. PF_READS: The number of reads that pass the vendor's filter. + 8. PF_UNIQUE_READS: The number of PF reads that are not marked as duplicates. + 9. PCT_PF_READS: PF reads / total reads. The percent of reads passing filter. + 10. PCT_PF_UQ_READS: PF Unique Reads / Total Reads. + 11. PF_UQ_READS_ALIGNED: The number of PF unique reads that are aligned with mapping score > 0 to the reference genome. + 12. PCT_PF_UQ_READS_ALIGNED: PF Reads Aligned / PF Reads. + 13. PF_UQ_BASES_ALIGNED: The number of bases in the PF aligned reads that are mapped to a reference base. Accounts for clipping and gaps. + 14. ON_BAIT_BASES: The number of PF aligned bases that mapped to a baited region of the genome. + 15. NEAR_BAIT_BASES: The number of PF aligned bases that mapped to within a fixed interval of a baited region, but not on a baited region. + 16. OFF_BAIT_BASES: The number of PF aligned bases that mapped to neither on or near a bait. + 17. ON_TARGET_BASES: The number of PF aligned bases that mapped to a targetted region of the genome. + 18. PCT_SELECTED_BASES: On+Near Bait Bases / PF Bases Aligned. + 19. PCT_OFF_BAIT: The percentage of aligned PF bases that mapped neither on or near a bait. + 20. ON_BAIT_VS_SELECTED: The percentage of on+near bait bases that are on as opposed to near. + 21. MEAN_BAIT_COVERAGE: The mean coverage of all baits in the experiment. + 22. MEAN_TARGET_COVERAGE: The mean coverage of targets that recieved at least coverage depth = 2 at one base. + 23. PCT_USABLE_BASES_ON_BAIT: The number of aligned, de-duped, on-bait bases out of the PF bases available. + 24. PCT_USABLE_BASES_ON_TARGET: The number of aligned, de-duped, on-target bases out of the PF bases available. + 25. FOLD_ENRICHMENT: The fold by which the baited region has been amplified above genomic background. + 26. ZERO_CVG_TARGETS_PCT: The number of targets that did not reach coverage=2 over any base. + 27. FOLD_80_BASE_PENALTY: The fold over-coverage necessary to raise 80% of bases in "non-zero-cvg" targets to the mean coverage level in those targets. + 28. PCT_TARGET_BASES_2X: The percentage of ALL target bases acheiving 2X or greater coverage. + 29. PCT_TARGET_BASES_10X: The percentage of ALL target bases acheiving 10X or greater coverage. + 30. PCT_TARGET_BASES_20X: The percentage of ALL target bases acheiving 20X or greater coverage. + 31. PCT_TARGET_BASES_30X: The percentage of ALL target bases acheiving 30X or greater coverage. + 32. HS_LIBRARY_SIZE: The estimated number of unique molecules in the selected part of the library. + 33. HS_PENALTY_10X: The "hybrid selection penalty" incurred to get 80% of target bases to 10X. This metric should be interpreted as: if I have a design with 10 megabases of target, and want to get 10X coverage I need to sequence until PF_ALIGNED_BASES = 10^6 * 10 * HS_PENALTY_10X. + 34. HS_PENALTY_20X: The "hybrid selection penalty" incurred to get 80% of target bases to 20X. This metric should be interpreted as: if I have a design with 10 megabases of target, and want to get 20X coverage I need to sequence until PF_ALIGNED_BASES = 10^6 * 20 * HS_PENALTY_20X. + 35. HS_PENALTY_30X: The "hybrid selection penalty" incurred to get 80% of target bases to 10X. This metric should be interpreted as: if I have a design with 10 megabases of target, and want to get 30X coverage I need to sequence until PF_ALIGNED_BASES = 10^6 * 30 * HS_PENALTY_30X. .. class:: warningmark -**Warning on Sam quality** +**Warning on SAM/BAM quality** -Unfortunately some packages seem perfectly capable of producing sam and bam files -that Picard will be picky about otherwise. Galaxy deals with this by using the lenient -flag, which allows reads to be discarded if they're empty or don't map. This appears -to be the only way to deal with sam that cannot be parsed. +Many SAM/BAM files produced externally and uploaded to Galaxy do not fully conform to SAM/BAM specifications. Galaxy deals with this by using the **LENIENT** +flag when it runs Picard, which allows reads to be discarded if they're empty or don't map. This appears +to be the only way to deal with SAM/BAM that cannot be parsed. ------ - -.. class:: infomark - -**Typical tool invocation without Galaxy is on a command line - eg:** - -java -jar /share/shared/galaxy/tool-data/shared/jars/CalculateHsMetrics.jar BAIT_INTERVALS=test.pic TARGET_INTERVALS=test.pic INPUT=test.bam -OUTPUT=picardHsMetrics.txt VALIDATION_STRINGENCY=LENIENT </help></tool> --- a/tools/picard/rgPicardInsertSize.xml Fri May 20 06:17:35 2011 +0100 +++ b/tools/picard/rgPicardInsertSize.xml Fri May 20 13:26:45 2011 -0400 @@ -7,15 +7,15 @@ -j "${GALAXY_DATA_INDEX_DIR}/shared/jars/CollectInsertSizeMetrics.jar" -d "$html_file.files_path" -t "$html_file" </command><inputs> - <param format="bam,sam" name="input_file" type="data" label="Input: sam or bam format short read data in your current history" - help="If the select list is empty, you need to upload or import some aligned short read data from a shared library"/> + <param format="bam,sam" name="input_file" type="data" label="SAM/BAM dataset to generate statistics for" + help="If empty, upload or import a SAM/BAM dataset."/><param name="out_prefix" value="Insertion size metrics" type="text" - label="Title for the output file - use this remind you what the job was for" size="120" /> + label="Title for the output file" help="Use this remind you what the job was for" size="120" /><param name="tailLimit" value="10000" type="integer" label="Tail limit" size="5" help="When calculating mean and stdev stop when the bins in the tail of the distribution contain fewer than mode/TAIL_LIMIT items" /><param name="histWidth" value="0" type="integer" - label="Histogram Width" size="5" + label="Histogram width" size="5" help="Explicitly sets the histogram width, overriding the TAIL_LIMIT option - leave 0 to ignore" /><param name="minPct" value="0.01" type="float" label="Minimum percentage" size="5" @@ -64,7 +64,7 @@ Picard documentation says (reformatted for Galaxy): -.. csv-table:: Insert size metrics docs +.. csv-table:: :header-rows: 1 Option,Description @@ -79,12 +79,11 @@ .. class:: warningmark -**Warning on Sam quality** +**Warning on SAM/BAM quality** -Unfortunately some packages seem perfectly capable of producing sam and bam files -that Picard will be picky about otherwise. Galaxy deals with this by using the lenient -flag, which allows reads to be discarded if they're empty or don't map. This appears -to be the only way to deal with sam that cannot be parsed. +Many SAM/BAM files produced externally and uploaded to Galaxy do not fully conform to SAM/BAM specifications. Galaxy deals with this by using the **LENIENT** +flag when it runs Picard, which allows reads to be discarded if they're empty or don't map. This appears +to be the only way to deal with SAM/BAM that cannot be parsed. </help></tool> --- a/tools/picard/rgPicardLibComplexity.xml Fri May 20 06:17:35 2011 +0100 +++ b/tools/picard/rgPicardLibComplexity.xml Fri May 20 13:26:45 2011 -0400 @@ -5,12 +5,12 @@ -j "${GALAXY_DATA_INDEX_DIR}/shared/jars/EstimateLibraryComplexity.jar" -d "$html_file.files_path" -t "$html_file" </command><inputs> - <param format="bam,sam" name="input_file" type="data" label="Input: sam or bam format short read data in your current history" - help="If the select list is empty, you need to upload or import some aligned short read data from a shared library"/> + <param format="bam,sam" name="input_file" type="data" label="SAM/BAM dataset" + help="If empty, upload or import a SAM/BAM dataset."/><param name="out_prefix" value="Library Complexity" type="text" - label="Title for the output file - use this remind you what the job was for" size="80" /> + label="Title for the output file" help="Use this remind you what the job was for." size="80" /><param name="minIDbases" value="5" type="integer" label="Minimum identical bases at starts of reads for grouping" size="5" - help="Total_reads / 4^max_id_bases reads will be compared at a time. Lower numbers = more accurate results and exponentially more time/ram" /> + help="Total_reads / 4^max_id_bases reads will be compared at a time. Lower numbers = more accurate results and exponentially more time/memory." /><param name="maxDiff" value="0.03" type="float" label="Maximum difference rate for identical reads" size="5" help="The maximum rate of differences between two reads to call them identical" /> @@ -84,7 +84,7 @@ Picard documentation says (reformatted for Galaxy): -.. csv-table:: Estimate complexity docs +.. csv-table:: :header-rows: 1 Option Description @@ -99,12 +99,11 @@ .. class:: warningmark -**Warning on Sam quality** +**Warning on SAM/BAM quality** -Unfortunately some packages seem perfectly capable of producing sam and bam files -that Picard will be picky about otherwise. Galaxy deals with this by using the lenient -flag, which allows reads to be discarded if they're empty or don't map. This appears -to be the only way to deal with sam that cannot be parsed. +Many SAM/BAM files produced externally and uploaded to Galaxy do not fully conform to SAM/BAM specifications. Galaxy deals with this by using the **LENIENT** +flag when it runs Picard, which allows reads to be discarded if they're empty or don't map. This appears +to be the only way to deal with SAM/BAM that cannot be parsed. .. class:: infomark --- a/tools/picard/rgPicardMarkDups.xml Fri May 20 06:17:35 2011 +0100 +++ b/tools/picard/rgPicardMarkDups.xml Fri May 20 13:26:45 2011 -0400 @@ -6,16 +6,16 @@ </command><requirements><requirement type="package">picard</requirement></requirements><inputs> - <param format="bam,sam" name="input_file" type="data" label="Input: sam or bam format short read data in your current history" - help="If the select list is empty, you need to upload or import some aligned short read data from a shared library"/> + <param format="bam,sam" name="input_file" type="data" label="SAM/BAM dataset to mark duplicates in" + help="If empty, upload or import a SAM/BAM dataset."/><param name="out_prefix" value="Dupes Marked" type="text" - label="Title for the output file - use this remind you what the job was for" size="80" /> + label="Title for the output file" help="Use this remind you what the job was for" size="80" /><param name="remDups" value="false" type="boolean" label="Remove duplicates from output file" truevalue="true" falsevalue="false" checked="yes" - help="If true do not write duplicates to the output file instead of writing them with appropriate flags set" /> + help="If true do not write duplicates to the output file instead of writing them with appropriate flags set." /><param name="assumeSorted" value="true" type="boolean" label="Assume reads are already ordered" truevalue="true" falsevalue="false" checked="yes" - help="If true assume input data are already sorted (most Galaxy sam/bam should be)" /> + help="If true assume input data are already sorted (most Galaxy SAM/BAM should be)." /><param name="readRegex" value="[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*" type="text" size="80" label="Regular expression that can be used to parse read names in the incoming SAM file" help="Names are parsed to extract: tile/region, x coordinate and y coordinate, to estimate optical duplication rate" > @@ -30,7 +30,7 @@ </param><param name="optDupeDist" value="100" type="integer" label="The maximum offset between two duplicate clusters in order to consider them optical duplicates." size="5" - help="e.g. 5-10 pixels. Later Illumina software versions multiply pixel values by 10, in which case 50-100" > + help="e.g. 5-10 pixels. Later Illumina software versions multiply pixel values by 10, in which case 50-100." ><validator type="in_range" message="Minimum optical dupe distance must be positive" min="0" /></param> @@ -68,7 +68,7 @@ **Purpose** -Marks all duplicate reads in a provided sam or bam file and either removes them or flags them. +Marks all duplicate reads in a provided SAM or BAM file and either removes them or flags them. **Picard documentation** @@ -100,27 +100,19 @@ .. class:: warningmark -**Warning on Sam quality** +**Warning on SAM/BAM quality** -Unfortunately some packages seem perfectly capable of producing sam and bam files -that Picard will be picky about otherwise. Galaxy deals with this by using the lenient -flag, which allows reads to be discarded if they're empty or don't map. This appears -to be the only way to deal with sam that cannot be parsed. - +Many SAM/BAM files produced externally and uploaded to Galaxy do not fully conform to SAM/BAM specifications. Galaxy deals with this by using the **LENIENT** +flag when it runs Picard, which allows reads to be discarded if they're empty or don't map. This appears +to be the only way to deal with SAM/BAM that cannot be parsed. .. class:: infomark **Note on the Regular Expression** - (from the Picard docs) - This tool requires a valid regular expression to parse out the read names in the incoming SAM or BAM file. - These values are used to estimate the rate of optical duplication in order to give a more accurate estimated library size. - The regular expression should contain three capture groups for the three variables, in order. - Default value: [a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*. +(from the Picard docs) +This tool requires a valid regular expression to parse out the read names in the incoming SAM or BAM file. These values are used to estimate the rate of optical duplication in order to give a more accurate estimated library size. The regular expression should contain three capture groups for the three variables, in order. Default value: [a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+). - Examines aligned records in the supplied SAM or BAM file to locate duplicate molecules. - All records are then written to the output file with the duplicate records flagged unless the - remove duplicates option is selected. In some cases you may want to do this, but please only do - this if you really understand what you are doing. +Examines aligned records in the supplied SAM or BAM file to locate duplicate molecules. All records are then written to the output file with the duplicate records flagged unless the remove duplicates option is selected. In some cases you may want to do this, but please only do this if you really understand what you are doing. </help></tool> Repository URL: https://bitbucket.org/galaxy/galaxy-central/ -- This is a commit notification from bitbucket.org. You are receiving this because you have the service enabled, addressing the recipient of this email.