August 2010 - galaxy-user - lists.galaxyproject.org

Python error when running Bowtie for Illumina
by Weng Khong Lim 01 Feb '11

01 Feb '11

Hi all, I'm new to next-gen sequencing, so please be gentle. I've just received a pair of Illumina FASTQ files from the sequencing facility and intend to map them to the hg19 reference genome. I first used the FASTQ Groomer utility to convert the reads into Sanger reads. However, when running Bowtie for Illumina on the resulting dataset under default settings, I received the following error: An error occurred running this job: *Error aligning sequence. requested number of bytes is more than a Python string can hold* * * Can someone help point out my mistake? My history is accessible at http://main.g2.bx.psu.edu/u/wengkhong_lim/h/chip-seq-pilot-batch Appreciate the help! Weng Khong, LIM Department of Genetics University of Cambridge E-mail: wkl24(a)cam.ac.uk Tel: +447503225832

4 3

Re: [galaxy-user] integrating Galaxy with a relational data warehouse?
by Yury Bukhman 01 Sep '10

01 Sep '10

Thank you, James, for your reply. I wonder if you could elaborate on why storing the bulk of the data in a relational database seems impractical, or point me to a document where this is discussed at more length. Yury On 08/31/10, James Taylor <james(a)jamestaylor.org> wrote: > Hi Yury, > > > we are planning to build a data warehouse for a research center that utilizes multiple high-throughput experimental platforms, e.g. plate-based HTS assays, microarrays of several different types, ChIP-seq, RNA-seq. We have been thinking of managing the data in a relational database. Galaxy looks attractive to us for its workflow management and data provenance features, e.g. to keep track of how raw data are analyzed to produce normalized & summarized datasets and/or final sets of statistics such as p values. We wonder how amenable would Galaxy be to integration with a relational data store. > > > > One possible scenario might be to have Galaxy import a dataset from a relational database, run a workflow, then submit the results back to the database with the associated history or link thereto. > > This is certainly a reasonable possibility. You could have a Galaxy tool for submitting data to your database. I would imagine such a tool would produce a Galaxy dataset as output with whatever unique identifier is necessary to recover exactly that data from the database for another analysis. > > > Another possibility is to forgo the relational database altogether and do all our data management within Galaxy. > > > I can only give you our experience from inside Galaxy. After initial analysis we made a decision to store all data in Galaxy as files on disk, with metadata (data about data, connections between datasets, workflows, et cetera) in a relational database. We feel this decision has worked well. For the scale of data we see, as well as the wide variety of different data types, a relational database did not, and still does not, seem practical to us. > > -- jt > > James Taylor > Assistant Professor > Department of Biology > Department of Mathematics & Computer Science > Emory University -- Yury V. Bukhman, Ph.D. Associate Scientist, Bioinformatics Great Lakes Bioenergy Research Center University of Wisconsin - Madison 445 Henry Mall, Rm. 513 Madison, WI 53706, USA Phone: 608-890-2680 Fax: 608-890-2427 Email: ybukhman(a)glbrc.wisc.edu

3 2

HTTP Error 503: Service Unavailable
by Erick Antezana 01 Sep '10

01 Sep '10

Hi, I've tried to update my local instance but I got the following error message: HTTP Error 503: Service Unavailable is the server down? cheers, Erick

2 1

integrating Galaxy with a relational data warehouse?
by Yury Bukhman 31 Aug '10

31 Aug '10

Hi, we are planning to build a data warehouse for a research center that utilizes multiple high-throughput experimental platforms, e.g. plate-based HTS assays, microarrays of several different types, ChIP-seq, RNA-seq. We have been thinking of managing the data in a relational database. Galaxy looks attractive to us for its workflow management and data provenance features, e.g. to keep track of how raw data are analyzed to produce normalized & summarized datasets and/or final sets of statistics such as p values. We wonder how amenable would Galaxy be to integration with a relational data store. One possible scenario might be to have Galaxy import a dataset from a relational database, run a workflow, then submit the results back to the database with the associated history or link thereto. Another possibility is to forgo the relational database altogether and do all our data management within Galaxy. Any thoughts? We don't have much experience with Galaxy and would appreciate insights from those who do. Many thanks. Yury -- Yury V. Bukhman, Ph.D. Associate Scientist, Bioinformatics Great Lakes Bioenergy Research Center University of Wisconsin - Madison 445 Henry Mall, Rm. 513 Madison, WI 53706, USA Phone: 608-890-2680 Fax: 608-890-2427 Email: ybukhman(a)glbrc.wisc.edu

2 1

Galaxy Genome Builds
by Anton Nekrutenko 31 Aug '10

31 Aug '10

Dear Galaxy user: Below are two important points about Galaxy genome builds that may affect your analysis results: 1. Canonical and Complete builds Many genomes provided by Galaxy originate from UCSC and are represented by a mix of canonical chromosomes (such as 1 through Y in mammals), mitochontrial genome (chrM), and unplaced fragments (chrUn) and/or haploblocks. Until last week we were using only canonical chromosomes and the mitochondrion to build indices required by NGS mappers (bwa, bowtie, bfast, lastz, and perm) and SAM Tools. However, for some analyses it is necessary to include regions that are not placed within the canonical chromosome set. Starting next week we will be providing two sets of NGS mapper indices: canonical (chr1 through chrY + chrM) and full (everything). A separate message will be sent out once this is enabled. 2. A recent un-announced change to hg19 The hg19 version of NGS and SAMTools indices contained only CANONICAL chromosomes (chr1 through Y + chrM) until August 20. On August 20 it was changed to the full version. As a result mapping jobs that were run after August 20, 2010 may return slightly different results. Anton Nekrutenko http://usegalaxy.org

1 0

Re: [galaxy-user] Galaxy developers conference
by Jeremy Goecks 27 Aug '10

27 Aug '10

(cc'ing galaxy-user to keep track of this issue) Hi Arthur, They are not password protected, but there is definitely a problem. This is likely a Bitbucket issue, and we're working on it to see what we can do. We'll get back to you soon. Thanks, J. > Hi Jeremy > > I'm unable to download The Galaxy team talks at http://bitbucket.org/galaxy/galaxy-central/wiki/DevConf2010 > bitbucket says: >> Internal Server Error >> We're sorry, but something seems to have gone wrong. >> >> An email with details about the error has already been sent to us, and we will look into the matter. >> > > Are they password protected? > > BR > A

2 2

[Genome] Genome-wide dataset of protein location conversion
by Manuel Leichsenring 27 Aug '10

27 Aug '10

Hi Dana, the way I do it, is the following: Get a list of all genes (e.g. RefSeq) using the "Get Data" "From UCSC Main Tablebrowser" option. "Get Flanks" of the genes (in your case 1kb or 5kb). Use "Join" from "Operate on genomic intervals" to intersect your peak regions with a) Upstream Flanks b) Downstream Flanks c) the genes themselves. You can then use the "Group" option and "Concatenate queries" to get a single list of genes without duplicates. Cheers Manuel ----- Original Message ----- Good Afternoon Dana: You could produce such a list from a given position with a MySQL operation to the UCSC public MySQL server. Note this Wiki page with a description of the command: http://genomewiki.ucsc.edu/index.php/Finding_nearby_genes --Hiram ----- Original Message ----- From: "Dana Levasseur" <dana-levasseur(a)uiowa.edu> To: galaxy-user(a)lists.bx.psu.edu Sent: Wednesday, August 25, 2010 3:26:28 PM GMT -08:00 US/Canada Pacific Subject: [galaxy-user] FW: [Genome] Genome-wide dataset of protein location conversion Hello, We have been trying to determine a method to convert ChIP-Seq coordinates into a list of genes and are wondering the best way to utilize the Galaxy browser. The UCSC folks suggested you could help but I should have been more specific with my request. Ideally we would like to take global binding coordinates and find out what genes are nearby (ie at either a 1 or 5kb) instead of simply the “closest feature”. Might you be able to advise on this? I have enclosed the text (.bed format) file I used to get the binding sites in the UCSC genome browser. Thanks in advance! Cheers, Dana _______________________________________________ galaxy-user mailing list galaxy-user(a)lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-user -- Neu: GMX De-Mail - Einfach wie E-Mail, sicher wie ein Brief! Jetzt De-Mail-Adresse reservieren: http://portal.gmx.net/de/go/demail

1 0

FW: [Genome] Genome-wide dataset of protein location conversion
by Levasseur, Dana 26 Aug '10

26 Aug '10

Hello, We have been trying to determine a method to convert ChIP-Seq coordinates into a list of genes and are wondering the best way to utilize the Galaxy browser. The UCSC folks suggested you could help but I should have been more specific with my request. Ideally we would like to take global binding coordinates and find out what genes are nearby (ie at either a 1 or 5kb) instead of simply the "closest feature". Might you be able to advise on this? I have enclosed the text (.bed format) file I used to get the binding sites in the UCSC genome browser. Thanks in advance! Cheers, Dana From: Mary Goldman [mailto:mary@soe.ucsc.edu] Sent: Thursday, August 12, 2010 2:06 PM To: Das, Satyabrata Cc: genome(a)soe.ucsc.edu Subject: Re: [Genome] Genome-wide dataset of protein location conversion Hi Satya, If you are looking for the nearest gene to a genomic coordinate, you should use Galaxy ( http://main.g2.bx.psu.edu/) Load your coordinates as a custom track on the UCSC genome browser, go to the table browser and send the output to Galaxy (click the check box after the output option pull down menu). I believe the tool you want is "Fetch closest feature" under the "Operate on Genomic Intervals" menu. If you have questions about Galaxy, please contact their help desk at galaxy-user(a)lists.bx.psu.edu. I hope this information addresses your question and is helpful. Please feel free to contact the mail list again if you require further assistance. Best, Mary ------------------ Mary Goldman UCSC Bioinformatics Group On 8/11/10 10:03 PM, Das, Satyabrata wrote: Hello, I am analyzing a genome-wide dataset of protein location coordinates in the following format and want to know if they can be converted to the gene that they are most closely located next to: chr1:3002834-3002851 chr1:4132776-4132783 chr1:4322743-4322748 chr2:155204062-155204080 chr2:155207569-155207570 chr2:155209754-155209758 chr2:155275773-155275774 chr2:155311478-155311484 The coordinates are derived from the mm8, Feb 2006, Build 36 genome build. I would like to know if the UCSC table browser has a batch function that enables conversion of these coordinates into the gene descriptions or gene symbols that the binding sites are positioned near (either within the gene bodies or intergenically). Thank you, Satya

3 2

finding file format of data item
by Kelly Vincent 26 Aug '10

26 Aug '10

Begin forwarded message: > From: galaxy-user-bounces(a)lists.bx.psu.edu > Date: August 25, 2010 5:44:32 PM EDT > To: galaxy-user-owner(a)lists.bx.psu.edu > Subject: Auto-discard notification > > The attached message has been automatically discarded. > From: "Belinda M. Giardine" <giardine(a)bx.psu.edu> > Date: August 25, 2010 5:44:28 PM EDT > To: Galaxy-user(a)bx.psu.edu > Subject: finding file format of data item > > > I have a tool that can be run on multiple data formats. But I need > to run it differently depending on the format of the input dataset. > How can I get the input data format? > > In the tools .xml file I have: > > <param format="interval,lped" name="input1" type="data" label="SNPs"> > > and > > #if $input1.metadata.format=="interval" ... > > The if doesn't work, what should be there? > > Thanks, > Belinda > > >

4 3

Data storage
by Fanny Coffin 26 Aug '10

26 Aug '10

Hi, I'm trying to evaluate the possibility to use Galaxy on our production environment for NGS data. And I've a question about the data storage. So, NGS provides huge files that we store on our servers in a specific folder organisation. By using Galaxy, these files have to be uploaded (in order to fill in the database with information like the first lines, the fields...). But I'm wondering whether these files necessarily have to be imported in the Galaxy workspace or whether they can just be linked? My question comes from the fact that we absolutely would like to avoid data duplication. Could you please enlighten me about that? Thanks in advance. Cordially. Fanny COFFIN

3 3