Re: [galaxy-dev] [galaxy-user] Experience with Loading NGS data on standalone instance of galaxy

21 Jul 2009

      Dear Assaf

Thanks a lot sharing your trick. I am currently travelling but I will
surely come back to you on this next week when I am back.

Best,
-Abhi

On Tue, Jul 21, 2009 at 12:30 PM, Assaf Gordon<gordon@cshl.edu> wrote:
...
Hello,
If I may suggest a workaround for getting big files fast into galaxy:
For a local galaxy server, and local files (on the same server as the
running galaxy), there is indeed no need to upload the files.
What we did is to create a simple tool, which accepts a local path, and
copies the file from the local path into the galaxy database.
Local file copy is much quicker than uploading the file through HTTP.
We need a full file copy because the source files are routinely deleted,
but if the source files are kept 'forever', you can modify the tool to
create a soft/hard link to the local file - that would be almost
instantaneous.
Here's an example of such tool:
### ( scifi.cshl.edu is mounted as SMB, so for all practice purposes it
behaves like local files )
###
### The XML file: import_scifi_file.xml
###
<tool id="cshl_import_scifi_file" name="Import SciFi file">
       <command interpreter="sh">import_scifi_file.sh '$filepath'
$output</command>
       <inputs>                  <param name="filepath" type="text"
               size="100" label="File path (on \\Scifi.cshl.edu\hannon )" />
       </inputs>
       <outputs>                <data format="txt" name="output"
label="Scifi Import: $filepath" />
       </outputs>
</tool>
###
### The shell script: import_scifi_file.sh
###
### The script basically copies the source file (param 1) ### to the
destination file (param 2, which is galaxy's dataset_NNNNN.dat).
###
### The extra code tries to make it as safe as possible, allowing imports
only
### from /media/scifi
###
### The script can be changed from copying to linking - would be even
faster.
###
#!/bin/sh
SCIFI_BASE_DIR="/media/scifi"
BASE_DIR_LEN=${#SCIFI_BASE_DIR}
INPUT="$1"
OUTPUT="$2"
if [ -z "$OUTPUT" ]; then
       echo "Usage: $0 [INPUT] [OUTPUT]" >&2
       exit 1
fi
if [ ! -d "${SCIFI_BASE_DIR}" ]; then
       echo "Internal Error: Scifi is not mounted on '${SCIFI_BASE_DIR}'"
...
&2
       exit 1
fi
FULLPATH="$INPUT"
# Convert backslashes (possibly pasted from windows machines) into forward
slashes
FULLPATH=${FULLPATH//\\/\/}
# Remove server prefix (possibly pasted by the user on a windows machine)
FULLPATH=${FULLPATH/\/\/scifi\.cshl\.edu\/hannon\//}
FULLPATH=${FULLPATH/\/\/scifi\/hannon\//}
#Construct full path with "/media/scifi" prefix
FULLPATH="${SCIFI_BASE_DIR}/${FULLPATH}"
# Safety check - #  change to the directory of the requested file.
#  It should begin with "/media/scifi".
#  If it doesn't, it means somebody tried to pull a trick by using
#  a bad mixture of "../../../../.." in the file path.
DIRECTORY=$(dirname "$FULLPATH")
pushd "$DIRECTORY" > /dev/null
REALDIR=$(pwd)
popd > /dev/null
#extract the prefix from the 'real' directory path
REAL_DIR_PREFIX=${REALDIR:0:$BASE_DIR_LEN}
#DEBUG
#echo "FULLPATH = $FULLPATH"
#echo "DIRECTORY= $DIRECTORY"
#echo "REALDIR  = $REALDIR"
#echo "REAL_DIR_PREFIX = $REAL_DIR_PREFIX"
#echo "BASE_DIR = $SCIFI_BASE_DIR"
# Probably foul play:
#  the real path of the requested input file does not start with the prefix
#  of '/media/scifi' - maybe somebody's trying to get a file outside 'scifi'
??
if [ "$REAL_DIR_PREFIX" != "$SCIFI_BASE_DIR" ]; then
       echo "Error: invalid input file ($INPUT)" >&2
       exit 1
fi
# If we got here, the $FULLPATH is at least in a valid location under the
'/media/scifi' directory.
if [ ! -r "$FULLPATH" ]; then
       echo "Error: input file ($INPUT) is not a valid file." >&2
       exit 1
fi
cp "$FULLPATH" "$OUTPUT"
if [ $? != 0  ]; then
       echo "Error: failed to copy \"$INPUT\"!" >&2
       exit 1
fi
echo "File \"$INPUT\" Imported."
exit 0
###
###
###
A further improvement is not to allow free-text file path, but instead use
dynamic options to select from a list of files, as so (in the XML file):
       <param name="localfile" type="select" label="Solexa Data File">
           <options from_file="cshl_import_files_hannon.txt">
               <column name="name" index="1"/>
               <column name="value" index="0"/>
           </options>
       </param>
And then have a cron job to create 'cshl_import_files_hannon.txt' with files
which can be uploaded.
Hope this helps,
 -gordon.
Greg Von Kuster wrote, On 07/21/2009 10:44 AM:
...
Hello Abhi,
Can you clarify the steps you took that produced the behavior?  See my
comments below.
Anton Nekrutenko wrote:
...
Abhishek:
Let talk. This is the area of active current development. We are  looking
at implementing a universal fastq-like format or supporting  multiple
formats. Perhaps we should join efforts in ironing out  specifications.
anton
galaxy team
On Jul 20, 2009, at 5:18 PM, Abhishek Pratap wrote:
...
Hi All
I recently came to know about NGS analysis on galaxy during ISMB.
Getting excited I tried couple of things basically to play with it.
Few comments : I may have interepretted something described below in a
wrong way. My apologies before hand.
On a standalone installation of galaxy while I was trying to explore
one FASTQ(sequence) file. It takes considerable (> 20 min) for a fastq
file to get uploaded (2 GB).
Are you using the Galaxy upload utility to create an item in your history
that points to the dataset file on disk?
I am not sure what is the rationale
...
...
behind that. Ideally I think there should be no need to upload such
heavy files into the workspace.
A data file that originates from a place external to Galaxy must be
uploaded into Galaxy so that the disk file can be placed in the location
configured in the Galaxy config file.  Also, when data is uploaded to Galaxy
( either to a history or a library ), several database table settings are
created that are used by various Galaxy features.
They could actually be used straight
...
...
away by the path specified.
What do you mean by "the path specified"?
Also is there any way to access the
...
...
scripts for analysis on the command line. I know this undermines the
main aim of working with galaxy but rite now I am concerned about the
performance/time.
You should be able to run any Galaxy tool from the command line as long as
you have all of the tool's required binaries in your path.  However, running
a tool from within Galaxy should generally not be any slower than running it
outside of Galaxy, depending, of course, on what you are doing.
...
...
I will be happy to discuss more about this in case you have some
comments/questions for me.
Best,
-Abhi
-----------------------------
Abhishek Pratap
Bioinformatics Software Engineer
Institute for Genome Sciences
School of Medicine, Univ of Maryland
801, W. Baltimore Street, Baltimore, MD 21209
Ph: (+1)-410-706-2296
www.igs.umaryland.edu/
_______________________________________________
galaxy-user mailing list
galaxy-user@bx.psu.edu
http://mail.bx.psu.edu/cgi-bin/mailman/listinfo/galaxy-user
Anton Nekrutenko
http://nekrut.bx.psu.edu
http://galaxyproject.org
_______________________________________________
galaxy-user mailing list
galaxy-user@bx.psu.edu
http://mail.bx.psu.edu/cgi-bin/mailman/listinfo/galaxy-user
_______________________________________________
galaxy-user mailing list
galaxy-user@bx.psu.edu
http://mail.bx.psu.edu/cgi-bin/mailman/listinfo/galaxy-user