Re: [galaxy-dev] GATK Unified Genotyper
Hi Daniel, I have been implementing the GATK unified genotyper and I was having some issues. Occasionally I would get an error that the Java virtual machine would not start. I got around that by adding the Xmx3g parameter to the command line. I also added the -nt 3 because otherwise it's really slow on human data. --stdout "${output_log}" #for $i, $input_bam in enumerate( $reference_source.input_bams ): -d "-I" "${input_bam.input_bam}" "${input_bam.input_bam.ext}" "gatk_input_${i}" -d "" "${input_bam.input_bam.metadata.bam_index}" "bam_index" "gatk_input_${i}" ##hardcode galaxy ext type as bam_index #end for -p 'java -Xmx3g -jar "${GALAXY_DATA_INDEX_DIR}/shared/jars/gatk/GenomeAnalysisTK.jar" -T "UnifiedGenotyper" -o "${output_vcf}" ##-o "out_vcf.txt" -et "NO_ET" ##ET no phone home ##-log "${output_log}" ##don't use this to log to file, instead directly capture stdout #if $reference_source.reference_source_selector != "history": -R "${reference_source.ref_file.fields.path}" #end if -nt 3 --standard_min_confidence_threshold_for_calling "${standard_min_confidence_threshold_for_calling}" --standard_min_confidence_threshold_for_emitting "${standard_min_confidence_threshold_for_emitting}" Ilya Chorny Ph.D. Bioinformatics Scientist I Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Work: 858.202.4582 Email: ichorny@illumina.com<mailto:ichorny@illumina.com> Website: www.illumina.com<http://www.illumina.com>
Hi Ilya, The JVM can be quite a pain when determining the best way to determine max memory, however the underlying gatk_wrapper.py script was updated last week in 5970:20215fcf6da7 to allow using -Xmx or XX:DefaultMaxRAMFraction. Each of the GATK tools are currently already set to use 'XX:DefaultMaxRAMFraction' of 2 (half of ram) - future enhancements to the tool running code will allow this to be set based upon e.g. number of jobs and available ram per node (no time frame for this). It does look like num_threads/nt is now available for the UnifiedGenotyper. Many of the GATK tools list it as an available parameter, but when run will give an error saying it is not yet implemented (only can determine by trial and error or reading source code). I've updated the Unified Genotyper to use --num_threads 4. Thanks! It might be a good idea to coordinate the changes you are looking into with GATK with the changes we are approaching. Any other feedback would also be greatly appreciated. Thanks for using Galaxy, Dan On Sep 7, 2011, at 1:25 PM, Chorny, Ilya wrote:
Hi Daniel,
I have been implementing the GATK unified genotyper and I was having some issues. Occasionally I would get an error that the Java virtual machine would not start. I got around that by adding the Xmx3g parameter to the command line. I also added the –nt 3 because otherwise it’s really slow on human data.
--stdout "${output_log}" #for $i, $input_bam in enumerate( $reference_source.input_bams ): -d "-I" "${input_bam.input_bam}" "${input_bam.input_bam.ext}" "gatk_input_${i}" -d "" "${input_bam.input_bam.metadata.bam_index}" "bam_index" "gatk_input_${i}" ##hardcode galaxy ext type as bam_index #end for -p 'java -Xmx3g -jar "${GALAXY_DATA_INDEX_DIR}/shared/jars/gatk/GenomeAnalysisTK.jar" -T "UnifiedGenotyper" -o "${output_vcf}" ##-o "out_vcf.txt" -et "NO_ET" ##ET no phone home ##-log "${output_log}" ##don't use this to log to file, instead directly capture stdout #if $reference_source.reference_source_selector != "history": -R "${reference_source.ref_file.fields.path}" #end if -nt 3 --standard_min_confidence_threshold_for_calling "${standard_min_confidence_threshold_for_calling}" --standard_min_confidence_threshold_for_emitting "${standard_min_confidence_threshold_for_emitting}"
Ilya Chorny Ph.D. Bioinformatics Scientist I Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Work: 858.202.4582 Email: ichorny@illumina.com Website: www.illumina.com
The newest version of the Unified Genotyper has a -glm option which needs to be set to both to call indels. BTW, any interest in writing a wrapper for the Depth of Coverage tool? Thanks, Ilya From: Daniel Blankenberg [mailto:dan@bx.psu.edu] Sent: Wednesday, September 07, 2011 10:57 AM To: Chorny, Ilya Cc: galaxy-dev@lists.bx.psu.edu Subject: Re: GATK Unified Genotyper Hi Ilya, The JVM can be quite a pain when determining the best way to determine max memory, however the underlying gatk_wrapper.py script was updated last week in 5970:20215fcf6da7 to allow using -Xmx or XX:DefaultMaxRAMFraction. Each of the GATK tools are currently already set to use 'XX:DefaultMaxRAMFraction' of 2 (half of ram) - future enhancements to the tool running code will allow this to be set based upon e.g. number of jobs and available ram per node (no time frame for this). It does look like num_threads/nt is now available for the UnifiedGenotyper. Many of the GATK tools list it as an available parameter, but when run will give an error saying it is not yet implemented (only can determine by trial and error or reading source code). I've updated the Unified Genotyper to use --num_threads 4. Thanks! It might be a good idea to coordinate the changes you are looking into with GATK with the changes we are approaching. Any other feedback would also be greatly appreciated. Thanks for using Galaxy, Dan On Sep 7, 2011, at 1:25 PM, Chorny, Ilya wrote: Hi Daniel, I have been implementing the GATK unified genotyper and I was having some issues. Occasionally I would get an error that the Java virtual machine would not start. I got around that by adding the Xmx3g parameter to the command line. I also added the -nt 3 because otherwise it's really slow on human data. --stdout "${output_log}" #for $i, $input_bam in enumerate( $reference_source.input_bams ): -d "-I" "${input_bam.input_bam}" "${input_bam.input_bam.ext}" "gatk_input_${i}" -d "" "${input_bam.input_bam.metadata.bam_index}" "bam_index" "gatk_input_${i}" ##hardcode galaxy ext type as bam_index #end for -p 'java -Xmx3g -jar "${GALAXY_DATA_INDEX_DIR}/shared/jars/gatk/GenomeAnalysisTK.jar" -T "UnifiedGenotyper" -o "${output_vcf}" ##-o "out_vcf.txt" -et "NO_ET" ##ET no phone home ##-log "${output_log}" ##don't use this to log to file, instead directly capture stdout #if $reference_source.reference_source_selector != "history": -R "${reference_source.ref_file.fields.path}" #end if -nt 3 --standard_min_confidence_threshold_for_calling "${standard_min_confidence_threshold_for_calling}" --standard_min_confidence_threshold_for_emitting "${standard_min_confidence_threshold_for_emitting}" Ilya Chorny Ph.D. Bioinformatics Scientist I Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Work: 858.202.4582 Email: ichorny@illumina.com<mailto:ichorny@illumina.com> Website: www.illumina.com<http://www.illumina.com>
Hi Ilya, --genotype_likelihoods_model / -glm is available under advanced options for the tool; it can be set to both, snp or indel. This parameter might be pulled out from the advanced options heading and placed on the base options in the future to make it easier to access. Thanks for the suggestions, and please let us know if you encounter any other usability concerns with these tools. The Depth of Coverage tool will likely be added, but there is not a time frame for when this will be available yet. Thanks for using Galaxy, Dan On Sep 9, 2011, at 6:59 PM, Chorny, Ilya wrote:
The newest version of the Unified Genotyper has a –glm option which needs to be set to both to call indels.
BTW, any interest in writing a wrapper for the Depth of Coverage tool?
Thanks,
Ilya
From: Daniel Blankenberg [mailto:dan@bx.psu.edu] Sent: Wednesday, September 07, 2011 10:57 AM To: Chorny, Ilya Cc: galaxy-dev@lists.bx.psu.edu Subject: Re: GATK Unified Genotyper
Hi Ilya,
The JVM can be quite a pain when determining the best way to determine max memory, however the underlying gatk_wrapper.py script was updated last week in 5970:20215fcf6da7 to allow using -Xmx or XX:DefaultMaxRAMFraction. Each of the GATK tools are currently already set to use 'XX:DefaultMaxRAMFraction' of 2 (half of ram) - future enhancements to the tool running code will allow this to be set based upon e.g. number of jobs and available ram per node (no time frame for this).
It does look like num_threads/nt is now available for the UnifiedGenotyper. Many of the GATK tools list it as an available parameter, but when run will give an error saying it is not yet implemented (only can determine by trial and error or reading source code). I've updated the Unified Genotyper to use --num_threads 4. Thanks!
It might be a good idea to coordinate the changes you are looking into with GATK with the changes we are approaching. Any other feedback would also be greatly appreciated.
Thanks for using Galaxy,
Dan
On Sep 7, 2011, at 1:25 PM, Chorny, Ilya wrote:
Hi Daniel,
I have been implementing the GATK unified genotyper and I was having some issues. Occasionally I would get an error that the Java virtual machine would not start. I got around that by adding the Xmx3g parameter to the command line. I also added the –nt 3 because otherwise it’s really slow on human data.
--stdout "${output_log}" #for $i, $input_bam in enumerate( $reference_source.input_bams ): -d "-I" "${input_bam.input_bam}" "${input_bam.input_bam.ext}" "gatk_input_${i}" -d "" "${input_bam.input_bam.metadata.bam_index}" "bam_index" "gatk_input_${i}" ##hardcode galaxy ext type as bam_index #end for -p 'java -Xmx3g -jar "${GALAXY_DATA_INDEX_DIR}/shared/jars/gatk/GenomeAnalysisTK.jar" -T "UnifiedGenotyper" -o "${output_vcf}" ##-o "out_vcf.txt" -et "NO_ET" ##ET no phone home ##-log "${output_log}" ##don't use this to log to file, instead directly capture stdout #if $reference_source.reference_source_selector != "history": -R "${reference_source.ref_file.fields.path}" #end if -nt 3 --standard_min_confidence_threshold_for_calling "${standard_min_confidence_threshold_for_calling}" --standard_min_confidence_threshold_for_emitting "${standard_min_confidence_threshold_for_emitting}"
Ilya Chorny Ph.D. Bioinformatics Scientist I Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Work: 858.202.4582 Email: ichorny@illumina.com Website: www.illumina.com
participants (2)
-
Chorny, Ilya
-
Daniel Blankenberg