Re: [galaxy-user] Experience with Loading NGS data on standalone instance of galaxy

4 Nov 2009

      Hi!

One more question according change set 2812:

Add a new option, 'allow_library_path_paste' that adds a new upload page
("Upload files from file system paths") to the admin-side library upload 
pages.

I am not sure where to add this option- i added to universe_wsgi.ini:

allow_library_path_paste = True
allow_library_path_paste = /somepath/

both didnt work- how can i enable the "allow_library_path_paste" option?

thanks!

mat

Greg Von Kuster schrieb:
...
Change set 2812 will be included in a release to the distribution today 
- here are details of a new option that we're hoping will provide what 
is needed for most labs.
Add a new option, 'allow_library_path_paste' that adds a new upload page
("Upload files from file system paths") to the admin-side library upload 
pages.
This form contains a textarea that allows Galaxy admins to paste any 
number of
file system paths (files or directories) from which Galaxy will import 
library
datasets, saving the directory structure (if desired).  Since such ability
allows admins access to any file on the Galaxy server which is readable by
Galaxy's system user, this option is disabled by default, and system
administrators should take care in assigning Galaxy administrators when this
feature is enabled.  Controls on what files are accessible to this tool 
based
on ownership or other properties can be added at a later date if there is
sufficient interest for such features.
This commit also includes a checkbox on the "Upload directory of files" page
(as well as the new "Upload files from file system paths" page above) 
that will
prevent Galaxy from copying data to its files directory (by default,
'database/files/').  This is useful for large library datasets that live in
their own managed locations on the file system, this will prevent the 
existence
of duplicate copies of datasets (but means administrators must take care to
manage data - moving or removing the data from its Galaxy-external location
will render these datasets invalid within Galaxy).
One unique feature to be aware of: when using the "Copy data into Galaxy?"
checkbox on the "Upload directory of files" page, any symbolic links
encountered in the chosen import directory will be made absolute and
dereferenced ONCE.  This allows administrators to link large datasets to the
import directory, rather than having to make full copies, while being 
able to
delete such links after importing.  Only the first symlink (the one in the
import directory itself) is dereferenced; all others remain.  See the 
following
for an example:
library_import_dir = /galaxy/import
% ls -lR /galaxy/import
/galaxy/import:
total 6
drwxr-xr-x   2 nate     nate         512 Oct  1 11:31 link/
/galaxy/import/link:
total 10
lrwxrwxrwx   1 nate     nate          71 Oct  1 10:38 1.bed -> 
../../../home/nate/galaxy/test-data/1.bed
lrwxrwxrwx   1 nate     nate          60 Oct  1 10:38 2.bed -> 
/home/nate/galaxy/test-data/2.bed
lrwxrwxrwx   1 nate     nate          11 Oct  1 10:38 3.bed -> ../../3.bed
lrwxrwxrwx   1 nate     nate          35 Oct  1 11:30 4.bed -> 
../../galaxy_symlink/test-data/4.bed
lrwxrwxrwx   1 nate     nate          41 Oct  1 11:31 5.bed -> 
/galaxy/galaxy_symlink/test-data/5.bed
% ls -l /galaxy/3.bed
lrwxrwxrwx   1 nate     nate          60 Oct  1 10:39 /galaxy/3.bed -> 
/home/nate/galaxy/test-data/3.bed
% ls -l /galaxy/galaxy_symlink
lrwxrwxrwx   1 nate     nate          44 Oct  1 11:30 
/galaxy/galaxy_symlink -> /home/nate/galaxy/
In this example,
1.bed is a relative symbolic link to the real 1.bed.
2.bed is an absolute symlink to the real 2.bed.
3.bed is a relative symlink to ../../3.bed, aka /galaxy/3.bed, which 
itself is
a symlink to the real 3.bed.
4.bed is a relative symlink which follows another symlink
(/galaxy/galaxy_symlink) to the real 4.bed.
5.bed is an absolute symlink in the same fashion as 4.bed
If the 'link' server directory is chosen on the "Upload directory of files"
page, and "Copy data into Galaxy?" is checked "No", the following files 
will be
referenced by Galaxy:
/home/nate/galaxy/test-data/1.bed
/home/nate/galaxy/test-data/2.bed
/galaxy/3.bed
/galaxy/galaxy_symlink/test-data/4.bed
/galaxy/galaxy_symlink/test-data/5.bed
The Galaxy administrator may now safely delete /galaxy/import/link, but 
should
take care not to remove the referenced symbolic links (/galaxy/3.bed,
/galaxy/galaxy_symlink).
Not all symbolic links are dereferenced because it is assumed that if an
administrator links to a path in the import directory which itself is (or
contains) links, that is the preferred path for accessing the data.
Oliver Hofmann wrote:
...
Dear all,
to echo what Abhi said: we are also currently looking of ways to 
automatically import data sets (libraries) into Galaxy without having to 
manually trigger the import via the administration interface, and 
ideally while keeping the data in the original place. The idea here is 
to have multiple tools all point at the original 'source data' without 
having to replicate terabytes of data.
Not quite sure how feasible this is in practice, but it certainly would 
be incredibly helpful.
Best,
Oliver
On 28 Sep 2009, at 14:24, Abhishek Pratap wrote:
...
HI Greg
Thanks for a quick reply and making some requested changes. However I 
am not still sure if importing NGS data will help in long run.
For Centers generating NGS data which could 2-3 T.B / week depending 
on no. of sequencers I think importing another copy of raw data into 
galaxy workspace will be asking for lot of disk space. I understand it 
is a neat way of doing things as it becomes agnostic of the raw data 
location  but might not be the best way for handling huge data in long 
run for centers like ours.
Please correct me if I am wrong. I think we could also have a simple 
option without having to import the data and just using it for 
analysis from the current location, also storing results at the same 
location. That way in future even if the data set is moved analysis 
also stays with it.
Let me know what you feel. I will be happy to know if there are any 
other smart reasons of importing the data in galaxy workspace just for 
curiosity sake.
Thanks,
-Abhi
On Mon, Sep 28, 2009 at 9:28 AM, Greg Von Kuster <ghv2@psu.edu> wrote:
Hello Abhishek,
The Galaxy distribution includes the enhancements to which I 
previously referred for uploading history files.  Uploading files to a 
history now creates a Galaxy job just like any other tool, and can be 
run on a cluster node, allowing upload of very large files.  The 
initial pass of this work is also completed for uploading to a Data 
Library, but this enhancement is still in test, so it should soon be 
available in the distribution.
Do you want to avoid having to import at all (e.g. allow Galaxy to 
refer to datasets that live in their original locations)?  This is not 
currently possible, but if this is what you are looking for, we can 
consider some additional options on the current upload form, or 
possibly a new, separate form.
Greg Von Kuster
Galaxy Development Team
Abhishek Pratap wrote:
Hi Greg, Anton and all
Just wondering if there has been any progress made on this end. I am 
sorry I was not able to follow it up on Assaf's suggestion due to 
other things at work.
I did try the latest version of galaxy and looks like the files are 
still transferred over HTTP before they could be used in the galaxy 
workspace. Also I would again like to highlight that many labs might 
want to use the local instance of galaxy and prefer to point to a 
local path where the file is being stored. That way we will have both 
the benefits of using a cool GUI and process data stored locally.
Let me know if you guys need some feedback or have more questions. I 
will be happy to discuss them.
best,
-Abhi
On Tue, Jul 21, 2009 at 4:26 PM, Greg Von Kuster <ghv2@psu.edu 
<mailto:ghv2@psu.edu>> wrote:
Hello Abishek,
We are currently in the process of significantly enhancing the
   current Galaxy upload utilities, and the new version should
   eliminate the issue you've raised about the time needed to upload
   large files via HTTP ( not for making an initial copy of the file in
   the Galaxy environment ). However, it will probably not be ready for
   release for a few more weeks, so if you can take advantage of
   Assaf's script in the meantime, that's great. ¨ÜI can't guarantee
   that all Galaxy features will function correctly if you do this 
though.
Assaf, have you found that using your script breaks anything?
Also, if you upload a file to a library rather than a history,
   multiple users can "import" the library dataset into their history
   for analysis, but there is only 1 file on disk ( users are pointing
   to it from their histories ). ¨ÜBut uploading a file to a history
   will create a new copy of the file each time it is uploaded.
Greg Von Kuster
   Galaxy Development Team
Abhishek Pratap wrote:
Hi All
@Greg : Please find my comments below.
On Tue, Jul 21, 2009 at 10:44 AM, Greg Von Kuster<ghv2@psu.edu
       <mailto:ghv2@psu.edu>> wrote:
Hello Abhi,
Can you clarify the steps you took that produced the
           behavior? ÇƒÜSee my
comments below.
Anton Nekrutenko wrote:
Abhishek:
Let talk. This is the area of active current
               development. We are ÇƒÜlooking
at implementing a universal fastq-like format or
               supporting ÇƒÜmultiple
formats. Perhaps we should join efforts in ironing out
               ÇƒÜspecifications.
anton
               galaxy team
On Jul 20, 2009, at 5:18 PM, Abhishek Pratap wrote:
Hi All
I recently came to know about NGS analysis on galaxy
                   during ISMB.
                   Getting excited I tried couple of things basically
                   to play with it.
Few comments : I may have interepretted something
                   described below in a
                   wrong way. My apologies before hand.
On a standalone installation of galaxy while I was
                   trying to explore
                   one FASTQ(sequence) file. It takes considerable (>
                   20 min) for a fastq
                   file to get uploaded (2 GB).
Are you using the Galaxy upload utility to create an item in
           your history
           that points to the dataset file on disk?
Yes that is precisely correct, I am trying to upload a solexa 
FASTQ
       file but on a standalone galaxy installation from my local file
       system.
I am not sure what is the rationale
behind that. Ideally I think there should be no need
                   to upload such
                   heavy files into the workspace.
A data file that originates from a place external to Galaxy
           must be uploaded
           into Galaxy so that the disk file can be placed in the
           location configured
           in the Galaxy config file. ÇƒÜAlso, when data is uploaded to
Galaxy ( either
           to a history or a library ), several database table settings
           are created
           that are used by various Galaxy features.
They could actually be used straight
Thanks for the clarification but I am not sure this will help a
       lot of
       people who are interested to install and run galaxy locally
       mainly for
       the following reasons. May be it is just local to me.
A. We already one instance of data saved on the local file system
       B. Making another copy via galaxy will eat away a lot of space
       in long run.
       C. The time needed to import the files into galaxy space is huge
away by the path specified.
What do you mean by "the path specified"?
Well what I mean was a way to specify the path of the file/run
       on the
       lcoal file system and galaxy could directly pick it up from there
       rather than uploading it into its own space. Now I understand this
       might not work based on the way the system was designed.
Also is there any way to access the
scripts for analysis on the command line. I know
                   this undermines the
                   main aim of working with galaxy but rite now I am
                   concerned about the
                   performance/time.
You should be able to run any Galaxy tool from the command
           line as long as
           you have all of the tool's required binaries in your path.
           ÇƒÜHowever, running
a tool from within Galaxy should generally not be any slower
           than running it
           outside of Galaxy, depending, of course, on what you are 
doing.
Ok I was under the impression that running from SHELL will 
eliminate
       the step of uploading them into galaxy file space.
-Abhi
I will be happy to discuss more about this in case
                   you have some
                   comments/questions for me.
Best,
                   -Abhi
-----------------------------
Abhishek Pratap
Bioinformatics Software Engineer
Institute for Genome Sciences
School of Medicine, Univ of Maryland
801, W. Baltimore Street, Baltimore, MD 21209
Ph: (+1)-410-706-2296
www.igs.umaryland.edu/ <http://www.igs.umaryland.edu/>
                   _______________________________________________
                   galaxy-user mailing list
                   galaxy-user@bx.psu.edu <mailto:galaxy-user@bx.psu.edu>
http://mail.bx.psu.edu/cgi-bin/mailman/listinfo/galaxy-user
Anton Nekrutenko
               http://nekrut.bx.psu.edu
               http://galaxyproject.org
_______________________________________________
               galaxy-user mailing list
               galaxy-user@bx.psu.edu <mailto:galaxy-user@bx.psu.edu>
http://mail.bx.psu.edu/cgi-bin/mailman/listinfo/galaxy-user
_______________________________________________
galaxy-user mailing list
galaxy-user@bx.psu.edu
http://mail.bx.psu.edu/cgi-bin/mailman/listinfo/galaxy-user
-- 
Research Associate    Department of Biostatistics
Associate Director    Bioinformatics Core
                      Harvard School of Public Health
Skype: ohofmann       Phone: +1 (617) 365 0984
_______________________________________________
galaxy-user mailing list
galaxy-user@bx.psu.edu
http://mail.bx.psu.edu/cgi-bin/mailman/listinfo/galaxy-user
-- 
------------------------------------------------
Matthias Dodt

Scientific Programmer at Bioinformaitcs platform AG Dieterich

Berlin Institute for Medical Systems Biology at the
Max-Delbrueck-Center for Molecular Medicine
Robert-Roessle-Strasse 10, 13125 Berlin, Germany

fon: +49 30 9406 4261
email: matthias.dodt@mdc-berlin.de