preprocessing gDNA illumina paired end data for mapping/snp calling
by Timothy Brennan
This question is w/ regards to pre-processing whole genome resequencing data for mapping data to a reference yeast strain.
I'm having trouble joining paired end data. I have two files per sample (read1 and read2).
I've successfully uploaded my fastq.gz files into galaxy using FTP. I have two fastq files for each direction per strain labelled for example:
(for the left hand dir) 130104_7001240_0133_AH0854ADXX.lane_2.CCGTCC.1.fastq
(for the right hand dir) 130104_7001240_0133_AH0854ADXX.lane_2.CCGTCC.2.fastq
Now once I groom each using FASTQ Groomer I'm trying to join them to get a single file and I'm joining 0% of the reads. So I think the header or directory is not in the correct format. E.g., the raw groomed reads for the left hand and right hand look like:
(for the left hand dir) 130104_7001240_0133_AH0854ADXX.lane_2.CCGTCC.1.fastq
@HWI-ST1240:133:H0854ADXX:2:1101:2716:1998 1:N:0:CCGTCC
NGTATGGAAGACGTAGAGTGGATGAAAATTTTGTGAAAAAAAAAAGCTTATAGGAACAAAAACATCCTTACATCTTCGGGTATTTCTTCTAGGGTTGAAGT
+
!!!%%%%%)))))**(*(!$()(((***(***')(**********)'))%!!%&%$$$$$####$$$$$$"$!!""!!##!!!!$$%$""!!"#$#!!!!!
@HWI-ST1240:133:H0854ADXX:2:1101:5045:1994 1:N:0:CCGTCC
NCCAGACACAGTTAACGCAACCTGACATGCAACAGTTATCGGGTTCTTGTGGTTTTGCAGGCACTTGGACACCTGCTATTTTCTTCGTTCCGCCGCTAAGC
(for the right hand dir) 130104_7001240_0133_AH0854ADXX.lane_2.CCGTCC.2.fastq
@HWI-ST1240:133:H0854ADXX:2:1101:2716:1998 2:N:0:CCGTCC
GCATAGTTACTTTTTGATCACTAACAACGATATATTATCGTTGAACAATTTACTACGCAAAACAGTTCACGTGATGTACGTCAGATAATTCACTGAAGGTA
+
$$$''''')))))++++++(+++++++++++++++++*+*++*++++++++++*++++*+++++*))))))''''&'&'%&%%%%$%%%'&%%%%%$%%$!
@HWI-ST1240:133:H0854ADXX:2:1101:5045:1994 2:N:0:CCGTCC
ATGTATTATAAGCCCGAATCAGATACTCAAATTTGAAAAAAGATATCTTTCTCCTCCGACATGGCCGAACTCATTTACATAAATAGCATAAATTAAACAGA
According to the wiki I think the fastq format should look something like this with /1 and /2 corresponding the each paired file.
@61CC3AAXX100125:7:118:2538:5577/1
GACACCTTTAATGTCTGAAAAGAGACATTCACCATCTATTCTCTTGGAGGGCTACCACCTAAGAGCCTTCATCCCC
+
?>CADFEEEBEDIEHHIDGGGEEEEHFFGIGIIFFIIEFHIIIIHIIFFIIIDEIIGIIIEHFFFIIEHIFA@?==
@61CC3AAXX100125:7:1:17320:13701/1
CTCAGAAGACCCTGAGAACATGTGCCCAAGGTGGTCACAGTGCATCTTAGTTTTGTACATTTTAGGGAGATATGAG
+
?BCAAADBBGGHGIDDDGHFEIFIIIIFGEIFIIFIGIGEFIIGGIIHEFFHHHIHEIFGHHIEFIIEECE?>@89
Any suggestions on how to get the files in the correct format/header to be able to join them?
Last question, what is the tool to trim reads based on quality again?
Thanks very much gentle people
Tim
9 years, 3 months
Uses for Galaxy
by Priya Bhatt
Hi Galaxy Users,
I'm very new to galaxy and have read/watched MANY galaxy tutorials but I have some questions for other users out there:
What are the top 10 ways that you find most useful for your analysis, especially to those who work with whole genome sequencing data?
Thanks for your help and insight, in advance!
Best,
Priya
9 years, 3 months
trouble using bcf tools.
by Patrick Leahy
Dear galaxy help, Devon
new user making good headway on galaxy cloud here at Case Western Reserve
university in Cleveland Ohio.
I have a small project. 10 human samples. I did a custom capture of exons
from 171 genes.
Total target size is ~ 0.5 Gb. PE reads from Illumina HighScan.
See attached screenshots. I uploaded one fastq file (1.2Gb).
I succeeded in doing fastQC, groomer, trimmer, allignment with Bowtie,
SAM to BAM, mpileup. Now, I want to convert to .vcf. for downstream work.
I got stuck at bcf tools ( see second slide). It is as if bcf tools doesn't
recognize the output from mpileup. Note how the option windows in bcf
tools are all collapsed.
I used default parameters throughout the workflow.
Any Ideas.
Patrick
--
Patrick Leahy, Ph.D.
*Asst. Professor, General Medical Sciences- Oncology*
Scientific Coordinator, Integrated Genomics Shared Resource (IGSR)
Scientific Managing Director Microarray Core Facility
Director, Laser Capture Microdissection Service
Case Western Reserve University
Room 3541 Wolstein Research Building
2103 Cornell Rd
Cleveland, OH 44106
t: 216 368 0761
f: 216 368 8919
http://cancer.case.edu/members/gms_onc/leahy.html
9 years, 3 months
problem with the joining of two interval dataset
by Milad Bastami
Hi every body
I tried to perform a joining task on two interval dataset. The 1st had 1500 region and the second one about 2 million region.The job is still running (more than 4 hours) and it has affected system performance. I'm not sure it is normal or not. If not, I don't know it's the matter ofgalaxy limitations or my pc's hardware configurations. Any help would be appreciated.
I am running galaxy local on bio-linux7 (win8 / bio-linux7 dual boot system)
system configurations :
CPU : core i7 9610 (6M)
RAM : 8 G
linux swap : 8 G
linux root : >700 G
Milad Bastamis,
Department of Medical Genetics
Shahid Beheshti's university of Medical Science
9 years, 3 months
Jobs waiting to run over a week
by Benard Lab
Hi,
I hope this message finds you well. I started a workflow in the public
galaxy main instance on April 14th (a week ago) and I am still waiting for
it to run.
This has never happened before in the past two months. Is there anything I
can do on my end or is it just an unusually long queue?
If it is a long queue, could you give me an estimate of when my jobs will
run?
Thank you. I look forward to your response.
Jim
9 years, 3 months
Re: [galaxy-user] Regarding a cuffdiff output
by Jennifer Hillman-Jackson
Hi Yona Kim,
On 4/18/13 9:54 PM, Yona Kim wrote:
> Dear Jennifer
>
> Thank you very much for your help for my analysis.
> I'm still stuck on getting the final data from Cuffdiff analysis.
> As you have mentioned, I've obtained the correct GTF file (mm9 gene
> annotation), and also made sure that the reference genome and GTF file
> are an exact match - they both are mm9.
Are you using the iGenomes version of the GTF file? With the attributes
Cuffdiff requires for generating all of the additional statistics? It
appears that this is the case, but I just wanted to double check. If not
using it, you can find a copy to load and use on the public server (if
this is where you are working) in Shared Data -> Data Libraries ->
iGenomes. Otherwise, it can be found at the Cufflinks web site.
These are the two attributes that are important to have, when available:
http://cufflinks.cbcb.umd.edu/manual.html#cuffdiff_input
>
> When I view the output of transcript differential expression testing
> (one of the outputs of cuffdiff) in excel, the names of the genes seem
> to be properly annotated according to their location on chromosome, but
> I have no values recorded for any of the calculations (I'm attaching
> this file just in case you want to take a look at it).
>
The results you are getting indicate the data coverage is sparse, which
aligns with your thoughts about this mapping not being as successful as
prior runs:
NOTEST and LOWDATA are explained here with advice about parameter tuning:
http://wiki.galaxyproject.org/Support#Tools_on_the_Main_server
follow links to Cufflinks FAQ to find:
http://cufflinks.cbcb.umd.edu/faq.html#notest
> Do you think that the problem might have been originated from fastq
> files itself?
>
> And also I was wondering about the reduce in the size of the files.
> Comparing with one of my other analysis in galaxy, I realized that the
> size of the file was significantly reduced from 6.1GB (fastq groomer) to
> 1006.9KB (Tophat accepted hits), whereas in my other analysis, the size
> was reduced from 5.9GB (fastq groomer) only to 1.6 GB(Tophat accepted
> hits).
>
> Do you think there might have been an error occurred when Tophat was
> running on the groomed data, and thus, providing an erroneous data to
> Cufflinks, and eventually to Cuffdiff?
This could be the source of the problem. Making sure that the data was
groomed correctly would be a good place the start. The comments from the
first run will note the detected input type (but there can be some
overlap), so also use the tool "FastQC" to help determine the proper
settings for "FASTQ Groomer". And if necessary, re-run from this step to
see if that improves the mapping.
http://wiki.galaxyproject.org/Support#Dataset_special_cases
See the second bullet under "FASTQ"
If your query data is short (less than around 40 bases), then tuning
Tophat could also improve mapping, see the tool's web page for advice
regarding mapping shorter sequences. Then test out a few different
parameter options to see what produces the best results for your
particular datasets/samples. There is a balance between being too
sensitive and too stringent - and this is a judgement call in most cases.
Trimming the reads may help if quality is an issue ("FastQC" will also
give information about this). The RNA-seq example tutorial has an
example of how to do basic QC:
https://main.g2.bx.psu.edu/u/jeremy/p/galaxy-rna-seq-analysis-exercise
Hopefully this helps to give some new options to test out that improve
the result!
Jen
Galaxy team
>
> Thank you very very much for your time and help
>
> Sincerely yours,
>
> Yona Kim
> Department of Genetics
> Rutgers University
>
>
>
>
> On Mon, Apr 8, 2013 at 4:54 PM, Jennifer Jackson <jen(a)bx.psu.edu
> <mailto:jen@bx.psu.edu>> wrote:
>
> Hi Yona,
>
> Yes, the GTF file is most likely the problem due to it lacking
> certain attributes that Cuffdiff requires to perform these
> calculations. You will also want to double check that the reference
> genome and GTF file (where you source it next) are an exact match -
> both the genome build and the identifier format. If either are not a
> match, you will not get the expected or full results that Cuffdiff
> can produce.
>
> This wiki has some help;
> http://wiki.galaxyproject.org/Support#Interpreting_scientific_results
> See "Tools on the Main server: Example → RNA-seq analysis tools."
>
> The links to the Cufflinks web site explains the attributes that
> Cuffdiff is looking for, links to the iGenomes datasets available
> (best to use if your genome is represented), and a pointer to the
> tool's user group. Two iGenomes GTF files are also already available
> in Galaxy (hg19, mm9) in "Shared Data -> Data Libraries ->
> iGenomes". The link to our tutorial and FAQ has help about how the
> GTF files are used along with troubleshooting advice.
>
> Best,
>
> Jen
> Galaxy team
>
>
> On 4/3/13 8:28 AM, Yona Kim wrote:
>> Dear galaxy users
>>
>> Hello. I have a quick question about Cuffdiff analysis.
>> I have obtained two SRA files and converted them to fastq files
>> which were uploaded to Galaxy via FTP server. My analysis was
>> followed by Fastq groomer, Tophat, Cufflinks, Cuffcompare, and
>> eventually Cuffdiff. (Gene annotation was also downloaded from
>> UCSC table browser in GTF format) I've downloaded gene
>> differential expression testing, one of the output files of
>> Cuffdiff, and viewed it in excel sheet. However, I have only zeros
>> recorded for value_1, value_2, log2, test_stat and only ones
>> recorded for p_value and q_value.
>>
>> Is it likely that I might have obtained wrong gene annotation file
>> and caused this problem?
>>
>> Thank you
>>
>> Yona Kim
>> Department of Genetics
>> Rutgers University - New Brunswick Campus
--
Jennifer Hillman-Jackson
Galaxy Support and Training
http://galaxyproject.org
9 years, 3 months
hello Newbie questions about galaxy and clustering
by leconte
hello
I have installed an galaxy instance for purpose testing on a 8 CPU
server without problem all run perfectly.
I have made some galaxy xml wrapper to understant how galaxy calls
program and other little things.
Now I want to install a Galaxy server with a cluster configuration.
Before I have tested HTCondor but I have no Idea how I can mix Galaxy
and HTCondor.
The perfect thing was somebody have already tested this type of condor.
But I can satisfy myself with any other cluster type.
What I don't understant in fact ( maybe my worst problem ) where lie
all parts of this puzzle.
I explain.
I know that I need Galaxy. I know too I need a cluster ( in my case
HTCondor )
I understand I need something between these 2 parts ... I think
something like PBS or SGE but it's not totaly clear for me
I'm search on galaxy mailling list archive ... but I just found 2 mails
in 2008 :(
has anybody made this conf or have a configuration ?
greetings
PS : sorry for my poor english :)
9 years, 3 months
Fwd: [Gmod-announce] 2013 GMOD Summer School: Apply now!
by Dave Clements
Hello all,
The 2013 GMOD Summer School is now accepting applications. Galaxy will
again be included in the topics covered.
And Amelia says:
Applications are competitive, so we encourage you to apply well before the
> deadline, June 10th.
So, if you are interested, act soon.
Dave C
---------- Forwarded message ----------
From: Amelia Ireland <amelia.ireland(a)gmod.org>
Date: Wed, Apr 17, 2013 at 11:15 AM
Subject: [Gmod-announce] 2013 GMOD Summer School: Apply now!
To: gmod-announce <gmod-announce(a)lists.sourceforge.net>,
gmod-devel(a)lists.sourceforge.net
We are now accepting applications for the 2013 GMOD Summer School, to be
held NESCent, Durham, North Carolina from July 19th to 23rd. The GMOD
Summer School is the best way to learn how to install, configure, and use
popular GMOD tools, including GBrowse, JBrowse, Galaxy, MAKER, Tripal,
WebApollo, and Chado; courses are taught by the tool developers, and there
will be evening sessions for those who want to work on their own data or
troubleshoot issues with the developers.
More information and online application form:
http://gmod.org/wiki/2013_GMOD_Summer_School
Applications are competitive, so we encourage you to apply well before the
deadline, June 10th.
If you have any questions, please contact help(a)gmod.org and we will be
happy to answer them.
--
Amelia Ireland
GMOD Community Support
http://gmod.org || @gmodproject
------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
_______________________________________________
Gmod-announce mailing list
Gmod-announce(a)lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/gmod-announce
--
http://galaxyproject.org/GCC2013
http://galaxyproject.org/
http://getgalaxy.org/
http://usegalaxy.org/
http://wiki.galaxyproject.org/
9 years, 3 months
fastq quality trimmer
by Tomaz Rijavec
Dear "Galaxy Team",
I have a question regarding the fastq quality trimmer (by sliding window)
tool. What is the script running behind it? Is it a modified version of the
fastx_quality_trimmer found in the fastx tools package? Is there a
standalone version of it one can set up locally and run it from the command
line.
Thanks for all the info you can provide.
best regards,
Tomaž
--
Lep pozdrav / Best Regards,
dr. Tomaž Rijavec
--------------------------------------------------------
Inštitut za fizikalno biologijo d.o.o. / Toplarniška ulica 19, 1000
Ljubljana / T: 01 587 54 70 / www.ifb.si
Institute of Physical Biology / Toplarniška ulica 19, SI-1000 Ljubljana,
Slovenia / T: +3861 587 54 70 / www.ifb.si
9 years, 3 months
Put an apache password on galaxy?
by Lee Katz
Hi, I am relatively new to Galaxy. I would like to set it up as an apache
password protected site. I have tried to do it, but I realize now that it
has its own server built-in and that it is ineffective to set up .htaccess
because it isn't reading .htaccess (or so I think).
More information: Apache works (test page loads, and I can change it), and
Galaxy works (I even made a new tool that is working). I have Ubuntu 12.
My basic, basic need is to set up a server-level password like what
.htaccess would provide. Could anyone help? Thank you.
--
Lee Katz, Ph.D.
9 years, 3 months