Dear Assaf Thanks a lot sharing your trick. I am currently travelling but I will surely come back to you on this next week when I am back. Best, -Abhi On Tue, Jul 21, 2009 at 12:30 PM, Assaf Gordon<gordon@cshl.edu> wrote:
Hello,
If I may suggest a workaround for getting big files fast into galaxy:
For a local galaxy server, and local files (on the same server as the running galaxy), there is indeed no need to upload the files.
What we did is to create a simple tool, which accepts a local path, and copies the file from the local path into the galaxy database. Local file copy is much quicker than uploading the file through HTTP.
We need a full file copy because the source files are routinely deleted, but if the source files are kept 'forever', you can modify the tool to create a soft/hard link to the local file - that would be almost instantaneous.
Here's an example of such tool:
### ( scifi.cshl.edu is mounted as SMB, so for all practice purposes it behaves like local files ) ### ### The XML file: import_scifi_file.xml ### <tool id="cshl_import_scifi_file" name="Import SciFi file"> <command interpreter="sh">import_scifi_file.sh '$filepath' $output</command>
<inputs> <param name="filepath" type="text" size="100" label="File path (on \\Scifi.cshl.edu\hannon )" /> </inputs>
<outputs> <data format="txt" name="output" label="Scifi Import: $filepath" /> </outputs> </tool>
### ### The shell script: import_scifi_file.sh ### ### The script basically copies the source file (param 1) ### to the destination file (param 2, which is galaxy's dataset_NNNNN.dat). ### ### The extra code tries to make it as safe as possible, allowing imports only ### from /media/scifi ### ### The script can be changed from copying to linking - would be even faster. ### #!/bin/sh
SCIFI_BASE_DIR="/media/scifi" BASE_DIR_LEN=${#SCIFI_BASE_DIR}
INPUT="$1" OUTPUT="$2"
if [ -z "$OUTPUT" ]; then echo "Usage: $0 [INPUT] [OUTPUT]" >&2 exit 1 fi
if [ ! -d "${SCIFI_BASE_DIR}" ]; then echo "Internal Error: Scifi is not mounted on '${SCIFI_BASE_DIR}'"
&2 exit 1 fi
FULLPATH="$INPUT"
# Convert backslashes (possibly pasted from windows machines) into forward slashes FULLPATH=${FULLPATH//\\/\/}
# Remove server prefix (possibly pasted by the user on a windows machine) FULLPATH=${FULLPATH/\/\/scifi\.cshl\.edu\/hannon\//} FULLPATH=${FULLPATH/\/\/scifi\/hannon\//}
#Construct full path with "/media/scifi" prefix FULLPATH="${SCIFI_BASE_DIR}/${FULLPATH}"
# Safety check - # change to the directory of the requested file. # It should begin with "/media/scifi". # If it doesn't, it means somebody tried to pull a trick by using # a bad mixture of "../../../../.." in the file path. DIRECTORY=$(dirname "$FULLPATH")
pushd "$DIRECTORY" > /dev/null REALDIR=$(pwd) popd > /dev/null
#extract the prefix from the 'real' directory path REAL_DIR_PREFIX=${REALDIR:0:$BASE_DIR_LEN}
#DEBUG #echo "FULLPATH = $FULLPATH" #echo "DIRECTORY= $DIRECTORY" #echo "REALDIR = $REALDIR" #echo "REAL_DIR_PREFIX = $REAL_DIR_PREFIX" #echo "BASE_DIR = $SCIFI_BASE_DIR"
# Probably foul play: # the real path of the requested input file does not start with the prefix # of '/media/scifi' - maybe somebody's trying to get a file outside 'scifi' ?? if [ "$REAL_DIR_PREFIX" != "$SCIFI_BASE_DIR" ]; then echo "Error: invalid input file ($INPUT)" >&2 exit 1 fi
# If we got here, the $FULLPATH is at least in a valid location under the '/media/scifi' directory. if [ ! -r "$FULLPATH" ]; then echo "Error: input file ($INPUT) is not a valid file." >&2 exit 1 fi
cp "$FULLPATH" "$OUTPUT" if [ $? != 0 ]; then echo "Error: failed to copy \"$INPUT\"!" >&2 exit 1 fi
echo "File \"$INPUT\" Imported."
exit 0
### ### ###
A further improvement is not to allow free-text file path, but instead use dynamic options to select from a list of files, as so (in the XML file):
<param name="localfile" type="select" label="Solexa Data File"> <options from_file="cshl_import_files_hannon.txt"> <column name="name" index="1"/> <column name="value" index="0"/> </options> </param>
And then have a cron job to create 'cshl_import_files_hannon.txt' with files which can be uploaded.
Hope this helps, -gordon.
Greg Von Kuster wrote, On 07/21/2009 10:44 AM:
Hello Abhi,
Can you clarify the steps you took that produced the behavior? See my comments below.
Anton Nekrutenko wrote:
Abhishek:
Let talk. This is the area of active current development. We are looking at implementing a universal fastq-like format or supporting multiple formats. Perhaps we should join efforts in ironing out specifications.
anton galaxy team
On Jul 20, 2009, at 5:18 PM, Abhishek Pratap wrote:
Hi All
I recently came to know about NGS analysis on galaxy during ISMB. Getting excited I tried couple of things basically to play with it.
Few comments : I may have interepretted something described below in a wrong way. My apologies before hand.
On a standalone installation of galaxy while I was trying to explore one FASTQ(sequence) file. It takes considerable (> 20 min) for a fastq file to get uploaded (2 GB).
Are you using the Galaxy upload utility to create an item in your history that points to the dataset file on disk?
I am not sure what is the rationale
behind that. Ideally I think there should be no need to upload such heavy files into the workspace.
A data file that originates from a place external to Galaxy must be uploaded into Galaxy so that the disk file can be placed in the location configured in the Galaxy config file. Also, when data is uploaded to Galaxy ( either to a history or a library ), several database table settings are created that are used by various Galaxy features.
They could actually be used straight
away by the path specified.
What do you mean by "the path specified"?
Also is there any way to access the
scripts for analysis on the command line. I know this undermines the main aim of working with galaxy but rite now I am concerned about the performance/time.
You should be able to run any Galaxy tool from the command line as long as you have all of the tool's required binaries in your path. However, running a tool from within Galaxy should generally not be any slower than running it outside of Galaxy, depending, of course, on what you are doing.
I will be happy to discuss more about this in case you have some comments/questions for me.
Best, -Abhi
-----------------------------
Abhishek Pratap
Bioinformatics Software Engineer
Institute for Genome Sciences
School of Medicine, Univ of Maryland
801, W. Baltimore Street, Baltimore, MD 21209
Ph: (+1)-410-706-2296
www.igs.umaryland.edu/ _______________________________________________ galaxy-user mailing list galaxy-user@bx.psu.edu http://mail.bx.psu.edu/cgi-bin/mailman/listinfo/galaxy-user
Anton Nekrutenko http://nekrut.bx.psu.edu http://galaxyproject.org
_______________________________________________ galaxy-user mailing list galaxy-user@bx.psu.edu http://mail.bx.psu.edu/cgi-bin/mailman/listinfo/galaxy-user
_______________________________________________ galaxy-user mailing list galaxy-user@bx.psu.edu http://mail.bx.psu.edu/cgi-bin/mailman/listinfo/galaxy-user