Python error when running Bowtie for Illumina
by Weng Khong Lim
Hi all,
I'm new to next-gen sequencing, so please be gentle. I've just received a
pair of Illumina FASTQ files from the sequencing facility and intend to map
them to the hg19 reference genome. I first used the FASTQ Groomer utility to
convert the reads into Sanger reads. However, when running Bowtie for
Illumina on the resulting dataset under default settings, I received the
following error:
An error occurred running this job: *Error aligning sequence. requested
number of bytes is more than a Python string can hold*
*
*
Can someone help point out my mistake? My history is accessible at
http://main.g2.bx.psu.edu/u/wengkhong_lim/h/chip-seq-pilot-batch
Appreciate the help!
Weng Khong, LIM
Department of Genetics
University of Cambridge
E-mail: wkl24(a)cam.ac.uk
Tel: +447503225832
12 years
Re: [galaxy-user] integrating Galaxy with a relational data warehouse?
by Yury Bukhman
Thank you, James, for your reply. I wonder if you could elaborate on why storing the bulk of the data in a relational database seems impractical, or point me to a document where this is discussed at more length.
Yury
On 08/31/10, James Taylor <james(a)jamestaylor.org> wrote:
> Hi Yury,
>
> > we are planning to build a data warehouse for a research center that utilizes multiple high-throughput experimental platforms, e.g. plate-based HTS assays, microarrays of several different types, ChIP-seq, RNA-seq. We have been thinking of managing the data in a relational database. Galaxy looks attractive to us for its workflow management and data provenance features, e.g. to keep track of how raw data are analyzed to produce normalized & summarized datasets and/or final sets of statistics such as p values. We wonder how amenable would Galaxy be to integration with a relational data store.
> >
> > One possible scenario might be to have Galaxy import a dataset from a relational database, run a workflow, then submit the results back to the database with the associated history or link thereto.
>
> This is certainly a reasonable possibility. You could have a Galaxy tool for submitting data to your database. I would imagine such a tool would produce a Galaxy dataset as output with whatever unique identifier is necessary to recover exactly that data from the database for another analysis.
>
> > Another possibility is to forgo the relational database altogether and do all our data management within Galaxy.
>
>
> I can only give you our experience from inside Galaxy. After initial analysis we made a decision to store all data in Galaxy as files on disk, with metadata (data about data, connections between datasets, workflows, et cetera) in a relational database. We feel this decision has worked well. For the scale of data we see, as well as the wide variety of different data types, a relational database did not, and still does not, seem practical to us.
>
> -- jt
>
> James Taylor
> Assistant Professor
> Department of Biology
> Department of Mathematics & Computer Science
> Emory University
--
Yury V. Bukhman, Ph.D.
Associate Scientist, Bioinformatics
Great Lakes Bioenergy Research Center
University of Wisconsin - Madison
445 Henry Mall, Rm. 513
Madison, WI 53706, USA
Phone: 608-890-2680 Fax: 608-890-2427
Email: ybukhman(a)glbrc.wisc.edu
12 years, 5 months
HTTP Error 503: Service Unavailable
by Erick Antezana
Hi,
I've tried to update my local instance but I got the following error message:
HTTP Error 503: Service Unavailable
is the server down?
cheers,
Erick
12 years, 5 months
integrating Galaxy with a relational data warehouse?
by Yury Bukhman
Hi,
we are planning to build a data warehouse for a research center that utilizes multiple high-throughput experimental platforms, e.g. plate-based HTS assays, microarrays of several different types, ChIP-seq, RNA-seq. We have been thinking of managing the data in a relational database. Galaxy looks attractive to us for its workflow management and data provenance features, e.g. to keep track of how raw data are analyzed to produce normalized & summarized datasets and/or final sets of statistics such as p values. We wonder how amenable would Galaxy be to integration with a relational data store.
One possible scenario might be to have Galaxy import a dataset from a relational database, run a workflow, then submit the results back to the database with the associated history or link thereto.
Another possibility is to forgo the relational database altogether and do all our data management within Galaxy.
Any thoughts? We don't have much experience with Galaxy and would appreciate insights from those who do.
Many thanks.
Yury
--
Yury V. Bukhman, Ph.D.
Associate Scientist, Bioinformatics
Great Lakes Bioenergy Research Center
University of Wisconsin - Madison
445 Henry Mall, Rm. 513
Madison, WI 53706, USA
Phone: 608-890-2680 Fax: 608-890-2427
Email: ybukhman(a)glbrc.wisc.edu
12 years, 5 months
Galaxy Genome Builds
by Anton Nekrutenko
Dear Galaxy user:
Below are two important points about Galaxy genome builds that may affect your analysis results:
1. Canonical and Complete builds
Many genomes provided by Galaxy originate from UCSC and are represented by a mix of canonical chromosomes (such as 1 through Y in mammals), mitochontrial genome (chrM), and unplaced fragments (chrUn) and/or haploblocks. Until last week we were using only canonical chromosomes and the mitochondrion to build indices required by NGS mappers (bwa, bowtie, bfast, lastz, and perm) and SAM Tools. However, for some analyses it is necessary to include regions that are not placed within the canonical chromosome set. Starting next week we will be providing two sets of NGS mapper indices: canonical (chr1 through chrY + chrM) and full (everything). A separate message will be sent out once this is enabled.
2. A recent un-announced change to hg19
The hg19 version of NGS and SAMTools indices contained only CANONICAL chromosomes (chr1 through Y + chrM) until August 20. On August 20 it was changed to the full version. As a result mapping jobs that were run after August 20, 2010 may return slightly different results.
Anton Nekrutenko
http://usegalaxy.org
12 years, 5 months
Re: [galaxy-user] Galaxy developers conference
by Jeremy Goecks
(cc'ing galaxy-user to keep track of this issue)
Hi Arthur,
They are not password protected, but there is definitely a problem. This is likely a Bitbucket issue, and we're working on it to see what we can do. We'll get back to you soon.
Thanks,
J.
> Hi Jeremy
>
> I'm unable to download The Galaxy team talks at http://bitbucket.org/galaxy/galaxy-central/wiki/DevConf2010
> bitbucket says:
>> Internal Server Error
>> We're sorry, but something seems to have gone wrong.
>>
>> An email with details about the error has already been sent to us, and we will look into the matter.
>>
>
> Are they password protected?
>
> BR
> A
12 years, 5 months
[Genome] Genome-wide dataset of protein location conversion
by Manuel Leichsenring
Hi Dana,
the way I do it, is the following:
Get a list of all genes (e.g. RefSeq) using the "Get Data" "From UCSC Main Tablebrowser" option.
"Get Flanks" of the genes (in your case 1kb or 5kb).
Use "Join" from "Operate on genomic intervals" to intersect your peak regions with a) Upstream Flanks b) Downstream Flanks c) the genes themselves.
You can then use the "Group" option and "Concatenate queries" to get a single list of genes without duplicates.
Cheers
Manuel
----- Original Message -----
Good Afternoon Dana:
You could produce such a list from a given position with a MySQL operation
to the UCSC public MySQL server. Note this Wiki page with a description
of the command:
http://genomewiki.ucsc.edu/index.php/Finding_nearby_genes
--Hiram
----- Original Message -----
From: "Dana Levasseur" <dana-levasseur(a)uiowa.edu>
To: galaxy-user(a)lists.bx.psu.edu
Sent: Wednesday, August 25, 2010 3:26:28 PM GMT -08:00 US/Canada Pacific
Subject: [galaxy-user] FW: [Genome] Genome-wide dataset of protein location conversion
Hello,
We have been trying to determine a method to convert ChIP-Seq coordinates into a list of genes and are wondering the best way to utilize the Galaxy browser. The UCSC folks suggested you could help but I should have been more specific with my request. Ideally we would like to take global binding coordinates and find out what genes are nearby (ie at either a 1 or 5kb) instead of simply the “closest feature”. Might you be able to advise on this? I have enclosed the text (.bed format) file I used to get the binding sites in the UCSC genome browser. Thanks in advance!
Cheers,
Dana
_______________________________________________
galaxy-user mailing list
galaxy-user(a)lists.bx.psu.edu
http://lists.bx.psu.edu/listinfo/galaxy-user
--
Neu: GMX De-Mail - Einfach wie E-Mail, sicher wie ein Brief!
Jetzt De-Mail-Adresse reservieren: http://portal.gmx.net/de/go/demail
12 years, 5 months
FW: [Genome] Genome-wide dataset of protein location conversion
by Levasseur, Dana
Hello,
We have been trying to determine a method to convert ChIP-Seq
coordinates into a list of genes and are wondering the best way to
utilize the Galaxy browser. The UCSC folks suggested you could help but
I should have been more specific with my request. Ideally we would like
to take global binding coordinates and find out what genes are nearby
(ie at either a 1 or 5kb) instead of simply the "closest feature". Might
you be able to advise on this? I have enclosed the text (.bed format)
file I used to get the binding sites in the UCSC genome browser. Thanks
in advance!
Cheers,
Dana
From: Mary Goldman [mailto:mary@soe.ucsc.edu]
Sent: Thursday, August 12, 2010 2:06 PM
To: Das, Satyabrata
Cc: genome(a)soe.ucsc.edu
Subject: Re: [Genome] Genome-wide dataset of protein location conversion
Hi Satya,
If you are looking for the nearest gene to a genomic coordinate, you
should use Galaxy ( http://main.g2.bx.psu.edu/). Load your coordinates
as a custom track on the UCSC genome browser, go to the table browser
and send the output to Galaxy (click the check box after the output
option pull down menu). I believe the tool you want is "Fetch closest
feature" under the "Operate on Genomic Intervals" menu. If you have
questions about Galaxy, please contact their help desk at
galaxy-user(a)lists.bx.psu.edu.
I hope this information addresses your question and is helpful. Please
feel free to contact the mail list again if you require further
assistance.
Best,
Mary
------------------
Mary Goldman
UCSC Bioinformatics Group
On 8/11/10 10:03 PM, Das, Satyabrata wrote:
Hello,
I am analyzing a genome-wide dataset of protein location coordinates in
the following format and want to know if they can be converted to the
gene that they are most closely located next to:
chr1:3002834-3002851
chr1:4132776-4132783
chr1:4322743-4322748
chr2:155204062-155204080
chr2:155207569-155207570
chr2:155209754-155209758
chr2:155275773-155275774
chr2:155311478-155311484
The coordinates are derived from the mm8, Feb 2006, Build 36 genome
build. I would like to know if the UCSC table browser has a batch
function that enables conversion of these coordinates into the gene
descriptions or gene symbols that the binding sites are positioned near
(either within the gene bodies or intergenically).
Thank you,
Satya
12 years, 5 months
finding file format of data item
by Kelly Vincent
Begin forwarded message:
> From: galaxy-user-bounces(a)lists.bx.psu.edu
> Date: August 25, 2010 5:44:32 PM EDT
> To: galaxy-user-owner(a)lists.bx.psu.edu
> Subject: Auto-discard notification
>
> The attached message has been automatically discarded.
> From: "Belinda M. Giardine" <giardine(a)bx.psu.edu>
> Date: August 25, 2010 5:44:28 PM EDT
> To: Galaxy-user(a)bx.psu.edu
> Subject: finding file format of data item
>
>
> I have a tool that can be run on multiple data formats. But I need
> to run it differently depending on the format of the input dataset.
> How can I get the input data format?
>
> In the tools .xml file I have:
>
> <param format="interval,lped" name="input1" type="data" label="SNPs">
>
> and
>
> #if $input1.metadata.format=="interval" ...
>
> The if doesn't work, what should be there?
>
> Thanks,
> Belinda
>
>
>
12 years, 5 months
Data storage
by Fanny Coffin
Hi,
I'm trying to evaluate the possibility to use Galaxy on our production
environment for NGS data.
And I've a question about the data storage. So, NGS provides huge files
that we store on our servers in a specific folder organisation. By using
Galaxy, these files have to be uploaded (in order to fill in the
database with information like the first lines, the fields...). But I'm
wondering whether these files necessarily have to be imported in the
Galaxy workspace or whether they can just be linked? My question comes
from the fact that we absolutely would like to avoid data duplication.
Could you please enlighten me about that?
Thanks in advance.
Cordially.
Fanny COFFIN
12 years, 5 months