June 2012 - galaxy-user - lists.galaxyproject.org

Re: [galaxy-user] (no subject)
by Jennifer Jackson 04 Jun '12

04 Jun '12

Hi Megan, I ran a few tests and found that changing the file suffix to .txt when using the "autodetect" upload type function speed up the loading process considerably. As the final result is an identical Galaxy dataset to what is produced with using the existing suffix, this is something I would recommend that you try next time. For my test, I took one of your files and change the suffix directly, no other changes were made to the content, as it was already a tab-delimited text file. I didn't continue with the testing to specify the datatype at upload (tabular would be the correct choice), but this is a change that may also speed up import slightly, although the .txt suffix change was dramatic alone and the upload was quick (I ran a side-by-side comparison of an original and .txt-suffix modified file). The general reason behind this is that Galaxy will interpret data to detect and confirm datatypes during upload to create associated metadata needed for tool use. Detection is a convenience option that comes at a cost (compute resource and time). If you can provide this information instead, the detection portion of the process can be avoided, confirmation and metadata creation can be started directly, and the result is a quicker upload. Hopefully this helps for next time, Jen Galaxy team On 5/22/12 8:49 AM, Estorninho, Megan wrote: > Yes I am still experiencing problems. My files are only around 80-120MB and are taking hours to load if at all. > Thanks for your help, > Megan > > Sent from my iPhone > > On 22 May 2012, at 14:38, "Jennifer Jackson"<jen(a)bx.psu.edu> wrote: > >> Hello Megan, >> >> Are you still experiencing problems now? Galaxy may have been busy >> immediately following the resolution of the cluster problem, although >> your problem does appear to be unrelated. >> >> It sounds like you are uploading file through a browser. A better choice >> would be to use FTP. This is required for datasets approaching or >> exceeding 2G in size. >> >> Files that are< 2G, really any file over ~ 500MB, can also benefit from >> FTP upload. An FTP client tracks the progress of an upload and can >> resume an interrupted transfer. http://wiki.g2.bx.psu.edu/FTPUpload >> >> Hopefully this helps, >> >> Jen >> Galaxy team >> >> On 5/21/12 10:21 AM, Estorninho, Megan wrote: >>> I have been unable to upload data files into Galaxy Main since Friday 18th May 2012. Today is my fourth day of attempting uploads. Refreshing and leaving the files to upload overnight does not work. >>> Although Jennifer has stated the bug has been fixed at 5.30pm today I am still unable to upload data files. I thought I may be exceeding maximum file capacity but I am well below at only 1.8Gb. >>> ___________________________________________________________ >>> The Galaxy User list should be used for the discussion of >>> Galaxy analysis and other features on the public server >>> at usegalaxy.org. Please keep all replies on the list by >>> using "reply all" in your mail client. For discussion of >>> local Galaxy instances and the Galaxy source code, please >>> use the Galaxy Development list: >>> >>> http://lists.bx.psu.edu/listinfo/galaxy-dev >>> >>> To manage your subscriptions to this and other Galaxy lists, >>> please use the interface at: >>> >>> http://lists.bx.psu.edu/ >> -- >> Jennifer Jackson >> http://galaxyproject.org >> -- Jennifer Jackson http://galaxyproject.org

1 0

GCC2012 Early Registration ends in 1 week
by Dave Clements 04 Jun '12

04 Jun '12

Hello all, Just a reminder that early registration for the 2012 Galaxy Community Conference (GCC2012) <http://galaxyproject.org/wiki/Events/GCC2012>, being held in Chicago, Illinois, July 25-27, *closes on June 11, one week from today*. Registering early saves 36 to 42% on registration costs, and allows you to book discounted conference lodging<http://wiki.g2.bx.psu.edu/Events/GCC2012/Logistics#Lodging> *before it fills up*. Register now<http://wiki.g2.bx.psu.edu/Events/GCC2012/Register> . GCC2012 <http://galaxyproject.org/wiki/Events/GCC2012> is about integrating, analyzing, and sharing the diverse and very large datasets that are now typical in biomedical research. This is an opportunity to share best practices with, and learn from, a large community of researchers and support staff who are facing the challenges of data-intensive biology. Galaxy <http://gmod.org/wiki/Galaxy> is an open web-based platform for data intensive biomedical research <http://galaxyproject.org> that is widely used and deployed at research organizations of all sizes and around the world. The GCC2012 Training Day<http://galaxyproject.org/wiki/Events/GCC2012/TrainingDay>agenda has been finalized. It has 3 parallel tracks, each featuring four, 90 minute workshops and covering 10 different topics. The final schedule<http://wiki.g2.bx.psu.edu/Events/GCC2012/Program>of speakers<http://wiki.g2.bx.psu.edu/Events/GCC2012/Program#Confirmed_Speakers>and abstracts <http://wiki.g2.bx.psu.edu/Events/GCC2012/Abstracts> are also now available. Hope to see you in Chicago! Dave Clements, on behalf of the GCC2012 Organizing Committee<http://galaxyproject.org/wiki/Events/GCC2012/Organizing%20Committee> -- http://galaxyproject.org/GCC2012 <http://galaxyproject.org/wiki/GCC2012> http://galaxyproject.org/ http://getgalaxy.org/ http://usegalaxy.org/ http://galaxyproject.org/wiki/

1 0

email database
by Roger Liu 04 Jun '12

04 Jun '12

hi im the new user at intalling my own galaxy and i developing some of the tools now i face a challenge i need to call the login(now) email address form my tools but i didnt find any API can did this so i was wondering is there had any database store the email address i need to identify the login user's email and return back to my tool thanks

1 0

ChIP-seq data analysis question
by cjt5＠buffalo.edu 02 Jun '12

02 Jun '12

Hello, My name is Christopher Terranova and am a M.S student at the University of Buffalo SUNY.I have been attempting to analyze my MACS data using Galaxy, already have my custom peaks on the UCSC Genome browser and have some specific questions. I am attempting to show how my peaks (and peak center coordinates) relate to gene units(+/-TSS and Genic) and intergenic regions specifically. I have been attempting to do this two different ways and am not sure if I am doing this correctly. Below I will list the steps I have been using with particular questions highlighted near my problem. I would also like to apologize for this extended e-mail, however, I have only been working with Galaxy for approx a month and attempting to figure all the manipulations is kind of difficult. If some can answer my questions I would greatly appreciate it!!! These questions relate specifically to promoters- 1.Retrieving TSS coordinates 1.Go to the UCSC genome browser, click "Tables" in the top of the page, and select mouse mm9 as the organism 2.select "RefSeq genes" in tracks, BED as the "output format" and check "Send output to galaxy" 3.click "Get output" then "Send output to galaxy", and you are redirected to your Galaxy account, which contains an additional dataset 4.use the galaxy "Filter" tool (left column) to select all "+" strand genes 5.use the "Cut" tool (left column) to extract columns 1,2,2,4,5,6 (**is the c2 column repeated twice??**) in order to build a BED file containing the TSS for all "+" strand genes 6.do the same for the genes on the "-" strand Computing peak center coordinates 1.In Galaxy, select the tool "Compute expression on every row" in the left column (Text manipulation section) 2.as an expression, select c2+(c3-c2+1)/2, round result "YES" 3.select the dataset containing the peaks for one of the TFs (HNF4a or CBPA), and click "execute"; this creates a new dataset with an additional column containing the coordinate of the peak center. 4.now select the tool "Cut", and extract the columns c1,c6,c6,c4,c5(**is the c6 column repeated twice??**) to create a new BED file containing the peak center 5.edit the metadata of this new dataset (clicking on the small pencil icon), and change the format to BED Computing distance to closest TSS 1.select the tool "Fetch closest non-overlapping feature", select the new dataset containing the peak center coordinates, and the dataset containing the mouse TSS. A new dataset is created containing for each peak, the closest TSS 2.compute the distance from the peak center to the closest TSS using the "Compute expression on every row" tool(**what expression should I use to do this**) 3.plot the distribution using the "Histogram of a numeric column" tool. Secondary way: I understand this is not identifying the peak center closest to the TSS or a particular strand, however, still have a couple questions? Now we have a data set corresponding to all human RefSeqs (34,765) and we want to convert this set into one corresponding to human promoter regions. First, we will make sure our data set just contains the start and end coordinates of the genes. Select the "Text Manipulation" tool and then "Cut" colums from a table. Set "cut columns" to "c1,c2,c3,c4,c6" (**Is this the right c1... conformation??**). Make sure our previously downloaded RefSeq tdat set is selected and click on "Execute". When this is finished, click on the pencil icon to assign names to the columns. Set name to "RefSeqs", click "save" and change the data type to "interval" and click "save". Now click the pencil icon again to define the columns. Set the start column to "2" and the end column to "3", the strand column to "5" and the "Name/Identifier" column to "4" and click on "save". Now, go to the "Operate on Genomic Intervals" section of the "Tools" menu and select "Get flanks" to get the flanking regions for the RefSeq data set we just created. Make sure our RefSeq data set is selected and we want to get the "upstream" flanking regions for this data set. Set the length of the flanking region to 1000 to get the coordinates for 1kb upstream. Later on we could use different intervals. Click on "Execute". When this has finished, go to "Operate on Genomic Intervals" again and select "Join". Now set "First query" to "Get flanks.." and "Second query" to the peaks file of the "MACS" output and then click on "Execute". We now end up with 710 regions where our ChIP-Seq peaks overlap with our 1kb upstream region (promoter region). Lastly, while not discussed here, what exactly does the offset command do when getting flanks? Thank you very much and again, I apologize for the extensive questions! Sincerely, Christopher Terranova

1 0

Search Galaxy Problems-solutions archive
by Jennifer Jackson 01 Jun '12

01 Jun '12

Hello Yanming, Prior mailing list Q/A archive searching: http://wiki.g2.bx.psu.edu/Mailing%20Lists#Searching Wiki resources: http://wiki.g2.bx.psu.edu/Support (also has mailing list search link & other custom google search links) http://wiki.g2.bx.psu.edu/Learn http://wiki.g2.bx.psu.edu/FrontPage Best, Jen Galaxy team On 6/1/12 12:38 PM, Yang, Yanming wrote: > Hi Jennifer, > > Is there an archive for questions-answers or problems-solutions back-forth emails for Galaxy, so that when I have problems/issues I can search the archive (database) first to see if they were encountered and already fixed? If there is, would you please show me the link? > > Thanks! > > Yanming > > ------------ > Yanming Yang, Ph.D. > Translational Sciences Lab > Florida State University, College of Medicine > 1115 W Call Street, MSR 1350-M > Tallahassee, FL 32306 > Office: 850-645-0019 > > > > > > > > > > > -- Jennifer Jackson http://galaxyproject.org

1 0

Cufflinks error with illumina igenome .GTF for annnotation
by Sarah Elisabeth Ewald 01 Jun '12

01 Jun '12

Hi all, I am attempting to use tophat>cufflinks>cuffmerge>cuffdiff to compare transcript expression in 3 samples (no replicates, illumina single-end reads). Using the built in UCSC mm9 reference genome I can complete the analysis just fine, with the caveat that there is no annotation. When I repeat the analysis using the illumina igenome UCSC mm9 .gtf annotation file I get the following error in Cufflinks: An error occurred running this job: cufflinks v1.3.0 cufflinks -q --no-update-check -I 300000 -F 0.100000 -j 0.150000 -p 8 -G /galaxy/main_pool/pool5/files/004/309/dataset_4309547.dat -N Error running cufflinks. return code = -11 cufflinks: /lib64/libz.so.1: no version information available I have set the identifier/build as "Mouse July 2007 (NCBI37/mm9) (mm9)" so that does not seem to be the probelem. Suggestions as to how to amend this problem OR add annotations to the already completed analysis would be terrific. Thanks! Sarah

2 1

June 2012 Galaxy Update
by Dave Clements 01 Jun '12

01 Jun '12

1 0

galaxy cloud not setting up properly
by Randall, Thomas (NIH/NIEHS) [C] 31 May '12

31 May '12

The last few times I have tried to initiate a galaxy instance on the cloud I have gotten messages like the following: * 18:42:04 - Master starting * 18:42:05 - Completed initial cluster configuration. * 18:42:09 - Prerequisites OK; starting service 'SGE' * 18:42:20 - Configuring SGE... * 18:42:29 - Successfully setup SGE; configuring SGE * 18:42:29 - Saved file 'persistent_data.yaml' to bucket 'cm-26cac39701f0918ab9a9dca54f69e925' * 18:42:29 - Saved file 'cm_boot.py' to bucket 'cm-26cac39701f0918ab9a9dca54f69e925' * 18:42:29 - Problem connecting to bucket 'cm-26cac39701f0918ab9a9dca54f69e925', attempt 1/5 * 18:42:32 - Saved file 'cm.tar.gz' to bucket 'cm-26cac39701f0918ab9a9dca54f69e925' * 18:42:32 - Saved file 'test.clusterName' to bucket 'cm-26cac39701f0918ab9a9dca54f69e925' * 18:44:34 - Initializing a 'Galaxy' cluster. * 18:44:34 - Retrieved file 'snaps.yaml' from bucket 'cloudman' to 'cm_snaps.yaml'. * 18:45:25 - Error mounting file system '/mnt/galaxyData' from '/dev/sdg3', running command '/bin/mount /dev/sdg3 /mnt/galaxyData' returned code '32' and following stderr: 'mount: you must specify the filesystem type ' * 18:45:27 - Prerequisites OK; starting service 'Postgres' * 18:45:27 - PostgreSQL data directory '/mnt/galaxyData/pgsql/data' does not exist (yet?) * 18:45:27 - Configuring PostgreSQL with a database for Galaxy... * 18:45:39 - Prerequisites OK; starting service 'Galaxy' * 18:45:39 - Setting up Galaxy application * 18:45:40 - Retrieved file 'universe_wsgi.ini.cloud' from bucket 'cloudman' to '/mnt/galaxyTools/galaxy-central/universe_wsgi.ini'. * 18:45:40 - Retrieved file 'tool_conf.xml.cloud' from bucket 'cloudman' to '/mnt/galaxyTools/galaxy-central/tool_conf.xml'. * 18:45:40 - Retrieved file 'tool_data_table_conf.xml.cloud' from bucket 'cloudman' to '/mnt/galaxyTools/galaxy-central/tool_data_table_conf.xml.cloud'. * 18:45:40 - Starting Galaxy... * 18:45:51 - Saved file 'persistent_data.yaml' to bucket 'cm-26cac39701f0918ab9a9dca54f69e925' * 18:49:34 - Galaxy daemon not running. * 18:49:34 - Galaxy service state changed from 'Starting' to 'Error' * 18:49:35 - Saved file 'persistent_data.yaml' to bucket 'cm-26cac39701f0918ab9a9dca54f69e925' * 18:49:41 - Galaxy daemon not running. * 18:49:58 - Galaxy daemon not running. * 18:50:15 - Galaxy daemon not running. I am using 861460482541/galaxy-cloudman-2011-03-22, which is supposed to be the current version. Tom Thomas Randall, PhD Bioinformatics Scientist, Contractor Integrative Bioinformatics National Institute of Environmental Health Sciences P.O. Box 12233, Research Triangle Park, NC 27709 randallta2(a)niehs.nih.gov<mailto:randallta2@niehs.nih.gov> 919-541-2271

2 1

Question about fetching sequence from genome
by Qianli Shen 31 May '12

31 May '12

Hi I want to fetch sequence from soybean genome, according to a gff file. My gff3 file and genome file are attached to the email, because it is not easy to recongnize the format if I paste it in the email. And it keeps reporting the error: An error occurred running this job: Traceback (most recent call last): File "/galaxy/home/g2main/galaxy_main/tools/extract/extract_genomic_dna.py", line 288, in <module> if __name__ == "__main__": __main__() File "/galaxy/home/g2main/galaxy_main/tools/extract/extract_genomic_dna.py" Could you please tell me where is the problem? Best Qianli

2 1