Problems with DataImport
Hi there, Here are three interrelated issues. I am trying to use Galaxy with some large cancer genomic datasets here at UCSC and do some systems biology. I have petabyte size dataset data libraries which will constantly be in flux at the edges. I would prefer to just have the Galaxy read the metadata from the file system for large datasets without using the database. Is there a convenient api boundary to write an adapter to the dataset object interface? In the meantime, I am going to try to just import day using the link. Its great that this feature is in already When I import into a couple of a modest megabyte size dataset using "Link to files without copying to Galaxy" option, the status never changes from "queued". Is this a bug? Is there a known work around? I have many large datasets. Also, it takes a long time to expand the dataset name link. (My experiment on import is a data tree of about a thousand files). Is this a known bug? Thanks! Ted
This bug irritated me, so I fixed it. Essentially add_file() in upload.py is not in on the joke that local dirs are relative paths and need the absolute path tacked onto it. Is there a written process on how to submit the fix? I could not find it. Thanks, Ted diff -r 21b645303c02 tools/data_source/upload.py --- a/tools/data_source/upload.py Thu Dec 22 13:54:33 2011 -0500 +++ b/tools/data_source/upload.py Sat Dec 31 15:29:45 2011 -0800 @@ -74,7 +74,10 @@ id, files_path, path = arg.split( ':', 2 ) rval[int( id )] = ( path, files_path ) return rval -def add_file( dataset, registry, json_file, output_path ): + +import pdb + +def add_file( dataset, registry, json_file, output_path, root_dir): data_type = None line_count = None converted_path = None @@ -94,7 +97,10 @@ file_err( 'Unable to fetch %s\n%s' % ( dataset.path, str( e ) ), dataset, json_file ) return dataset.path = temp_name - # See if we have an empty file + + if dataset.type == 'server_dir' and not os.path.isabs( dataset.path): + dataset.path = os.path.join( root_dir, dataset.path ) + if not os.path.exists( dataset.path ): file_err( 'Uploaded temporary file (%s) does not exist.' % dataset.path, dataset, json_file ) return @@ -384,7 +390,7 @@ files_path = output_paths[int( dataset.dataset_id )][1] add_composite_file( dataset, registry, json_file, output_path, files_path ) else: - add_file( dataset, registry, json_file, output_path ) + add_file( dataset, registry, json_file, output_path , sys.argv[1]) # clean up paramfile try: os.remove( sys.argv[3] ) [ted@tap galaxy-central]$ !v On Dec 29, 2011, at 1:22 AM, Ted Goldstein wrote:
Hi there, Here are three interrelated issues.
I am trying to use Galaxy with some large cancer genomic datasets here at UCSC and do some systems biology. I have petabyte size dataset data libraries which will constantly be in flux at the edges. I would prefer to just have the Galaxy read the metadata from the file system for large datasets without using the database. Is there a convenient api boundary to write an adapter to the dataset object interface?
In the meantime, I am going to try to just import day using the link. Its great that this feature is in already When I import into a couple of a modest megabyte size dataset using "Link to files without copying to Galaxy" option, the status never changes from "queued". Is this a bug? Is there a known work around? I have many large datasets.
Also, it takes a long time to expand the dataset name link. (My experiment on import is a data tree of about a thousand files). Is this a known bug?
Thanks! Ted ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
On Dec 31, 2011, at 6:38 PM, Ted Goldstein wrote:
This bug irritated me, so I fixed it. Essentially add_file() in upload.py is not in on the joke that local dirs are relative paths and need the absolute path tacked onto it. Is there a written process on how to submit the fix? I could not find it.
Hi Ted, Is this the case when library_import_dir in the config file is relative? I've always used an absolute path there. I suppose it couldn't hurt to make it absolute programatically. --nate
Thanks, Ted
diff -r 21b645303c02 tools/data_source/upload.py --- a/tools/data_source/upload.py Thu Dec 22 13:54:33 2011 -0500 +++ b/tools/data_source/upload.py Sat Dec 31 15:29:45 2011 -0800 @@ -74,7 +74,10 @@ id, files_path, path = arg.split( ':', 2 ) rval[int( id )] = ( path, files_path ) return rval -def add_file( dataset, registry, json_file, output_path ): + +import pdb + +def add_file( dataset, registry, json_file, output_path, root_dir): data_type = None line_count = None converted_path = None @@ -94,7 +97,10 @@ file_err( 'Unable to fetch %s\n%s' % ( dataset.path, str( e ) ), dataset, json_file ) return dataset.path = temp_name - # See if we have an empty file + + if dataset.type == 'server_dir' and not os.path.isabs( dataset.path): + dataset.path = os.path.join( root_dir, dataset.path ) + if not os.path.exists( dataset.path ): file_err( 'Uploaded temporary file (%s) does not exist.' % dataset.path, dataset, json_file ) return @@ -384,7 +390,7 @@ files_path = output_paths[int( dataset.dataset_id )][1] add_composite_file( dataset, registry, json_file, output_path, files_path ) else: - add_file( dataset, registry, json_file, output_path ) + add_file( dataset, registry, json_file, output_path , sys.argv[1]) # clean up paramfile try: os.remove( sys.argv[3] ) [ted@tap galaxy-central]$ !v
On Dec 29, 2011, at 1:22 AM, Ted Goldstein wrote:
Hi there, Here are three interrelated issues.
I am trying to use Galaxy with some large cancer genomic datasets here at UCSC and do some systems biology. I have petabyte size dataset data libraries which will constantly be in flux at the edges. I would prefer to just have the Galaxy read the metadata from the file system for large datasets without using the database. Is there a convenient api boundary to write an adapter to the dataset object interface?
In the meantime, I am going to try to just import day using the link. Its great that this feature is in already When I import into a couple of a modest megabyte size dataset using "Link to files without copying to Galaxy" option, the status never changes from "queued". Is this a bug? Is there a known work around? I have many large datasets.
Also, it takes a long time to expand the dataset name link. (My experiment on import is a data tree of about a thousand files). Is this a known bug?
Thanks! Ted ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
participants (2)
-
Nate Coraor
-
Ted Goldstein