Creating multiple datasets in a libset

newer
Adding data sources in the main...

Rob Leclerc

26 Apr 2013 26 Apr '13

11:54 p.m.

I'm looking at example_watch_folder.py and it's not clear from the example how you submit multiple datasets to a library. In the example, the first submit returns a libset [] with only a single entry and then proceeds to iterate through each dataset in the libset in the following section: data = {} data['folder_id'] = library_folder_id data['file_type'] = 'auto' data['dbkey'] = '' data['upload_option'] = 'upload_paths' *data['filesystem_paths'] = fullpath* data['create_type'] = 'file' libset = submit(api_key, api_url + "libraries/%s/contents" % library_id, data, return_formatted = False) time.sleep(5) for ds in libset: if 'id' in ds: wf_data = {} wf_data['workflow_id'] = workflow['id'] wf_data['history'] = "%s - %s" % (fname, workflow[ 'name']) wf_data['ds_map'] = {} for step_id, ds_in in workflow['inputs' ].iteritems(): wf_data['ds_map'][step_id] = {'src':'ld', 'id' :ds['id']} res = submit( api_key, api_url + 'workflows', wf_data, return_formatted=False) Rob Leclerc, PhD <http://www.linkedin.com/in/robleclerc> <https://twitter.com/#!/robleclerc> P: (US) +1-(917)-873-3037 P: (Shanghai) +86-1-(861)-612-5469 Personal Email: rob.leclerc@aya.yale.edu

Attachments:

attachment.htm (text/html — 3.9 KB)

Show replies by date

Dannon Baker

29 Apr 29 Apr

9:15 a.m.

Hey Rob, That example_watch_folder.py does just submit exactly one at a time, executes the workflow, and then does the next all in separate transactions. If you wanted to upload multiple filepaths at once, you'd just append more to the ''filesystem_paths' field (newline separated paths). -Dannon On Fri, Apr 26, 2013 at 11:54 PM, Rob Leclerc <robert.leclerc@gmail.com>wrote:

...

I'm looking at example_watch_folder.py and it's not clear from the example how you submit multiple datasets to a library. In the example, the first submit returns a libset [] with only a single entry and then proceeds to iterate through each dataset in the libset in the following section:

data = {}

data['folder_id'] = library_folder_id

data['file_type'] = 'auto'

data['dbkey'] = ''

data['upload_option'] = 'upload_paths'

*data['filesystem_paths'] = fullpath*

data['create_type'] = 'file'

libset = submit(api_key, api_url + "libraries/%s/contents" % library_id, data, return_formatted = False)

time.sleep(5)

for ds in libset:

if 'id' in ds:

wf_data = {}

wf_data['workflow_id'] = workflow['id']

wf_data['history'] = "%s - %s" % (fname, workflow[ 'name'])

wf_data['ds_map'] = {}

for step_id, ds_in in workflow['inputs' ].iteritems():

wf_data['ds_map'][step_id] = {'src':'ld', 'id' :ds['id']}

res = submit( api_key, api_url + 'workflows', wf_data, return_formatted=False)

Rob Leclerc, PhD <http://www.linkedin.com/in/robleclerc><https://twitter.com/#!/robleclerc> P: (US) +1-(917)-873-3037 P: (Shanghai) +86-1-(861)-612-5469 Personal Email: rob.leclerc@aya.yale.edu

___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/

Rob Leclerc

9:24 a.m.

Hi Dannon, Thanks for the response. Sorry to be pedantic, but just to make sure that I understand the interpretation of this field on the other side of the API, I would need to have something like the following: data['filesystem_paths'] = "/home/me/file1.vcf \n /home/me/file2.vcf /n /home/me/file3.vcf" I assume I should also increase the time.sleep() to reflect the uploading of extra files? Cheers, Rob Rob Leclerc, PhD <http://www.linkedin.com/in/robleclerc> <https://twitter.com/#!/robleclerc> P: (US) +1-(917)-873-3037 P: (Shanghai) +86-1-(861)-612-5469 Personal Email: rob.leclerc@aya.yale.edu On Mon, Apr 29, 2013 at 9:15 AM, Dannon Baker <dannon.baker@gmail.com>wrote:

...

Hey Rob,

That example_watch_folder.py does just submit exactly one at a time, executes the workflow, and then does the next all in separate transactions. If you wanted to upload multiple filepaths at once, you'd just append more to the ''filesystem_paths' field (newline separated paths).

-Dannon

On Fri, Apr 26, 2013 at 11:54 PM, Rob Leclerc <robert.leclerc@gmail.com>wrote:

...
I'm looking at example_watch_folder.py and it's not clear from the example how you submit multiple datasets to a library. In the example, the first submit returns a libset [] with only a single entry and then proceeds to iterate through each dataset in the libset in the following section:

data = {}

data['folder_id'] = library_folder_id

data['file_type'] = 'auto'

data['dbkey'] = ''

data['upload_option'] = 'upload_paths'

*data['filesystem_paths'] = fullpath*

data['create_type'] = 'file'

libset = submit(api_key, api_url + "libraries/%s/contents" % library_id, data, return_formatted = False)

time.sleep(5)

for ds in libset:

if 'id' in ds:

wf_data = {}

wf_data['workflow_id'] = workflow['id']

wf_data['history'] = "%s - %s" % (fname, workflow['name'])

wf_data['ds_map'] = {}

for step_id, ds_in in workflow['inputs' ].iteritems():

wf_data['ds_map'][step_id] = {'src':'ld', 'id':ds['id']}

res = submit( api_key, api_url + 'workflows', wf_data, return_formatted=False)

Rob Leclerc, PhD <http://www.linkedin.com/in/robleclerc><https://twitter.com/#!/robleclerc> P: (US) +1-(917)-873-3037 P: (Shanghai) +86-1-(861)-612-5469 Personal Email: rob.leclerc@aya.yale.edu

___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/

Dannon Baker

11:18 a.m.

Yep, that example filesystem_paths you suggest should work fine. The sleep() bit was a complete hack from the start, for simplicity in demonstrating a very basic pipeline, but what you probably want to do for a real implementation is query the dataset in question via the API, verify that the datatype/etc have been set, and only after that execute the workflow; instead of relying on sleep. On Mon, Apr 29, 2013 at 9:24 AM, Rob Leclerc <robert.leclerc@gmail.com>wrote:

...

Hi Dannon,

Thanks for the response. Sorry to be pedantic, but just to make sure that I understand the interpretation of this field on the other side of the API, I would need to have something like the following:

data['filesystem_paths'] = "/home/me/file1.vcf \n /home/me/file2.vcf /n /home/me/file3.vcf"

I assume I should also increase the time.sleep() to reflect the uploading of extra files?

Cheers,

Rob

Rob Leclerc, PhD <http://www.linkedin.com/in/robleclerc><https://twitter.com/#!/robleclerc> P: (US) +1-(917)-873-3037 P: (Shanghai) +86-1-(861)-612-5469 Personal Email: rob.leclerc@aya.yale.edu

On Mon, Apr 29, 2013 at 9:15 AM, Dannon Baker <dannon.baker@gmail.com>wrote:

...
Hey Rob,

That example_watch_folder.py does just submit exactly one at a time, executes the workflow, and then does the next all in separate transactions. If you wanted to upload multiple filepaths at once, you'd just append more to the ''filesystem_paths' field (newline separated paths).

-Dannon

On Fri, Apr 26, 2013 at 11:54 PM, Rob Leclerc <robert.leclerc@gmail.com>wrote:

...
I'm looking at example_watch_folder.py and it's not clear from the example how you submit multiple datasets to a library. In the example, the first submit returns a libset [] with only a single entry and then proceeds to iterate through each dataset in the libset in the following section:

data = {}

data['folder_id'] = library_folder_id

data['file_type'] = 'auto'

data['dbkey'] = ''

data['upload_option'] = 'upload_paths'

*data['filesystem_paths'] = fullpath*

data['create_type'] = 'file'

libset = submit(api_key, api_url + "libraries/%s/contents" % library_id, data, return_formatted = False)

time.sleep(5)

for ds in libset:

if 'id' in ds:

wf_data = {}

wf_data['workflow_id'] = workflow['id']

wf_data['history'] = "%s - %s" % (fname, workflow['name'])

wf_data['ds_map'] = {}

for step_id, ds_in in workflow['inputs' ].iteritems():

wf_data['ds_map'][step_id] = {'src':'ld', 'id':ds['id']}

res = submit( api_key, api_url + 'workflows', wf_data, return_formatted=False)

Rob Leclerc, PhD <http://www.linkedin.com/in/robleclerc><https://twitter.com/#!/robleclerc> P: (US) +1-(917)-873-3037 P: (Shanghai) +86-1-(861)-612-5469 Personal Email: rob.leclerc@aya.yale.edu

___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/

Rob Leclerc

4:11 p.m.

Hi Dannon, I've written some code to (i) query a dataset to ensure that it's been uploaded after a submit and (ii) to ensure a resulting dataset has been written to the file. *#Block until all datasets have been uploaded* libset = submit(api_key, api_url + "libraries/%s/contents" % library_id, data, return_formatted = False) for ds in libset: while True: uploaded_file = display(api_key, api_url + 'libraries/%s/contents/%s' %(library_id, ds['id']), return_formatted=False) if uploaded_file['misc_info'] == None: time.sleep(1) else: break *#Block until all result datasets have been saved to the filesystem* result_ds_url = api_url + 'histories/' + history_id + '/contents/' + dsh['id']; while True: result_ds = display(api_key, result_ds_url, return_formatted=False) if result_ds["state"] == 'ok': break else: time.sleep(1) Rob Leclerc, PhD <http://www.linkedin.com/in/robleclerc> <https://twitter.com/#!/robleclerc> P: (US) +1-(917)-873-3037 P: (Shanghai) +86-1-(861)-612-5469 Personal Email: rob.leclerc@aya.yale.edu On Mon, Apr 29, 2013 at 11:18 AM, Dannon Baker <dannon.baker@gmail.com>wrote:

...

Yep, that example filesystem_paths you suggest should work fine. The sleep() bit was a complete hack from the start, for simplicity in demonstrating a very basic pipeline, but what you probably want to do for a real implementation is query the dataset in question via the API, verify that the datatype/etc have been set, and only after that execute the workflow; instead of relying on sleep.

On Mon, Apr 29, 2013 at 9:24 AM, Rob Leclerc <robert.leclerc@gmail.com>wrote:

...
Hi Dannon,

Thanks for the response. Sorry to be pedantic, but just to make sure that I understand the interpretation of this field on the other side of the API, I would need to have something like the following:

data['filesystem_paths'] = "/home/me/file1.vcf \n /home/me/file2.vcf /n /home/me/file3.vcf"

I assume I should also increase the time.sleep() to reflect the uploading of extra files?

Cheers,

Rob

Rob Leclerc, PhD <http://www.linkedin.com/in/robleclerc><https://twitter.com/#!/robleclerc> P: (US) +1-(917)-873-3037 P: (Shanghai) +86-1-(861)-612-5469 Personal Email: rob.leclerc@aya.yale.edu

On Mon, Apr 29, 2013 at 9:15 AM, Dannon Baker <dannon.baker@gmail.com>wrote:

...
Hey Rob,

That example_watch_folder.py does just submit exactly one at a time, executes the workflow, and then does the next all in separate transactions. If you wanted to upload multiple filepaths at once, you'd just append more to the ''filesystem_paths' field (newline separated paths).

-Dannon

On Fri, Apr 26, 2013 at 11:54 PM, Rob Leclerc <robert.leclerc@gmail.com>wrote:

...
I'm looking at example_watch_folder.py and it's not clear from the example how you submit multiple datasets to a library. In the example, the first submit returns a libset [] with only a single entry and then proceeds to iterate through each dataset in the libset in the following section:

data = {}

data['folder_id'] = library_folder_id

data['file_type'] = 'auto'

data['dbkey'] = ''

data['upload_option'] = 'upload_paths'

*data['filesystem_paths'] = fullpath*

data['create_type'] = 'file'

libset = submit(api_key, api_url + "libraries/%s/contents" % library_id, data, return_formatted = False)

time.sleep(5)

for ds in libset:

if 'id' in ds:

wf_data = {}

wf_data['workflow_id'] = workflow['id']

wf_data['history'] = "%s - %s" % (fname, workflow['name'])

wf_data['ds_map'] = {}

for step_id, ds_in in workflow['inputs' ].iteritems():

wf_data['ds_map'][step_id] = {'src':'ld', 'id':ds['id']}

res = submit( api_key, api_url + 'workflows', wf_data, return_formatted=False)

Rob Leclerc, PhD <http://www.linkedin.com/in/robleclerc><https://twitter.com/#!/robleclerc> P: (US) +1-(917)-873-3037 P: (Shanghai) +86-1-(861)-612-5469 Personal Email: rob.leclerc@aya.yale.edu

___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/

Rob Leclerc

7:37 p.m.

Correction. The above were not reliable methods to ensure the file was copied into the data library. Checking for file_size != 0 was also not effective for large files. Dannon, can you tell me which field we should query and what state/message which will allow us to avoid race conditions? The only solution that I can see that when the file_size != 0, you then ensure it that the file_size has not changed after a a short delay. Cheers, Rob Rob Leclerc, PhD <http://www.linkedin.com/in/robleclerc> <https://twitter.com/#!/robleclerc> P: (US) +1-(917)-873-3037 P: (Shanghai) +86-1-(861)-612-5469 Personal Email: rob.leclerc@aya.yale.edu On Mon, Apr 29, 2013 at 4:11 PM, Rob Leclerc <robert.leclerc@gmail.com>wrote:

...

Hi Dannon,

I've written some code to (i) query a dataset to ensure that it's been uploaded after a submit and (ii) to ensure a resulting dataset has been written to the file.

*#Block until all datasets have been uploaded* libset = submit(api_key, api_url + "libraries/%s/contents" % library_id, data, return_formatted = False) for ds in libset: while True: uploaded_file = display(api_key, api_url + 'libraries/%s/contents/%s' %(library_id, ds['id']), return_formatted=False) if uploaded_file['misc_info'] == None: time.sleep(1) else: break

*#Block until all result datasets have been saved to the filesystem* result_ds_url = api_url + 'histories/' + history_id + '/contents/' + dsh['id']; while True: result_ds = display(api_key, result_ds_url, return_formatted=False) if result_ds["state"] == 'ok': break else: time.sleep(1)

Rob Leclerc, PhD <http://www.linkedin.com/in/robleclerc><https://twitter.com/#!/robleclerc> P: (US) +1-(917)-873-3037 P: (Shanghai) +86-1-(861)-612-5469 Personal Email: rob.leclerc@aya.yale.edu

On Mon, Apr 29, 2013 at 11:18 AM, Dannon Baker <dannon.baker@gmail.com>wrote:

...
Yep, that example filesystem_paths you suggest should work fine. The sleep() bit was a complete hack from the start, for simplicity in demonstrating a very basic pipeline, but what you probably want to do for a real implementation is query the dataset in question via the API, verify that the datatype/etc have been set, and only after that execute the workflow; instead of relying on sleep.

On Mon, Apr 29, 2013 at 9:24 AM, Rob Leclerc <robert.leclerc@gmail.com>wrote:

...
Hi Dannon,

Thanks for the response. Sorry to be pedantic, but just to make sure that I understand the interpretation of this field on the other side of the API, I would need to have something like the following:

data['filesystem_paths'] = "/home/me/file1.vcf \n /home/me/file2.vcf /n /home/me/file3.vcf"

I assume I should also increase the time.sleep() to reflect the uploading of extra files?

Cheers,

Rob

Rob Leclerc, PhD <http://www.linkedin.com/in/robleclerc><https://twitter.com/#!/robleclerc> P: (US) +1-(917)-873-3037 P: (Shanghai) +86-1-(861)-612-5469 Personal Email: rob.leclerc@aya.yale.edu

On Mon, Apr 29, 2013 at 9:15 AM, Dannon Baker <dannon.baker@gmail.com>wrote:

...
Hey Rob,

That example_watch_folder.py does just submit exactly one at a time, executes the workflow, and then does the next all in separate transactions. If you wanted to upload multiple filepaths at once, you'd just append more to the ''filesystem_paths' field (newline separated paths).

-Dannon

On Fri, Apr 26, 2013 at 11:54 PM, Rob Leclerc <robert.leclerc@gmail.com

...
wrote:

...
I'm looking at example_watch_folder.py and it's not clear from the example how you submit multiple datasets to a library. In the example, the first submit returns a libset [] with only a single entry and then proceeds to iterate through each dataset in the libset in the following section:

data = {}

data['folder_id'] = library_folder_id

data['file_type'] = 'auto'

data['dbkey'] = ''

data['upload_option'] = 'upload_paths'

*data['filesystem_paths'] = fullpath*

data['create_type'] = 'file'

libset = submit(api_key, api_url + "libraries/%s/contents" % library_id, data, return_formatted = False)

time.sleep(5)

for ds in libset:

if 'id' in ds:

wf_data = {}

wf_data['workflow_id'] = workflow['id']

wf_data['history'] = "%s - %s" % (fname, workflow['name'])

wf_data['ds_map'] = {}

for step_id, ds_in in workflow['inputs' ].iteritems():

wf_data['ds_map'][step_id] = {'src':'ld', 'id':ds['id']}

res = submit( api_key, api_url + 'workflows', wf_data, return_formatted=False)

Rob Leclerc, PhD <http://www.linkedin.com/in/robleclerc><https://twitter.com/#!/robleclerc> P: (US) +1-(917)-873-3037 P: (Shanghai) +86-1-(861)-612-5469 Personal Email: rob.leclerc@aya.yale.edu

___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/

Dannon Baker

30 Apr 30 Apr

5:31 p.m.

Hey Rob, I haven't touched in quite some time, but from the comments it indicates that the sleep was because the datatype hadn't been detected and set yet, so that's what I'd check for -- specifically result_ds['data_type'] being something sensible (I have no idea what it is by default off the top of my head). Let me know if this doesn't work and I can take a closer look. -Dannon On Mon, Apr 29, 2013 at 7:37 PM, Rob Leclerc <robert.leclerc@gmail.com>wrote:

...

Correction. The above were not reliable methods to ensure the file was copied into the data library. Checking for file_size != 0 was also not effective for large files.

Dannon, can you tell me which field we should query and what state/message which will allow us to avoid race conditions?

The only solution that I can see that when the file_size != 0, you then ensure it that the file_size has not changed after a a short delay.

Cheers, Rob

Rob Leclerc, PhD <http://www.linkedin.com/in/robleclerc><https://twitter.com/#!/robleclerc> P: (US) +1-(917)-873-3037 P: (Shanghai) +86-1-(861)-612-5469 Personal Email: rob.leclerc@aya.yale.edu

On Mon, Apr 29, 2013 at 4:11 PM, Rob Leclerc <robert.leclerc@gmail.com>wrote:

...
Hi Dannon,

I've written some code to (i) query a dataset to ensure that it's been uploaded after a submit and (ii) to ensure a resulting dataset has been written to the file.

*#Block until all datasets have been uploaded* libset = submit(api_key, api_url + "libraries/%s/contents" % library_id, data, return_formatted = False) for ds in libset: while True: uploaded_file = display(api_key, api_url + 'libraries/%s/contents/%s' %(library_id, ds['id']), return_formatted=False) if uploaded_file['misc_info'] == None: time.sleep(1) else: break

*#Block until all result datasets have been saved to the filesystem* result_ds_url = api_url + 'histories/' + history_id + '/contents/' + dsh['id']; while True: result_ds = display(api_key, result_ds_url, return_formatted=False) if result_ds["state"] == 'ok': break else: time.sleep(1)

Rob Leclerc, PhD <http://www.linkedin.com/in/robleclerc><https://twitter.com/#!/robleclerc> P: (US) +1-(917)-873-3037 P: (Shanghai) +86-1-(861)-612-5469 Personal Email: rob.leclerc@aya.yale.edu

On Mon, Apr 29, 2013 at 11:18 AM, Dannon Baker <dannon.baker@gmail.com>wrote:

...
Yep, that example filesystem_paths you suggest should work fine. The sleep() bit was a complete hack from the start, for simplicity in demonstrating a very basic pipeline, but what you probably want to do for a real implementation is query the dataset in question via the API, verify that the datatype/etc have been set, and only after that execute the workflow; instead of relying on sleep.

On Mon, Apr 29, 2013 at 9:24 AM, Rob Leclerc <robert.leclerc@gmail.com>wrote:

...
Hi Dannon,

Thanks for the response. Sorry to be pedantic, but just to make sure that I understand the interpretation of this field on the other side of the API, I would need to have something like the following:

data['filesystem_paths'] = "/home/me/file1.vcf \n /home/me/file2.vcf /n /home/me/file3.vcf"

I assume I should also increase the time.sleep() to reflect the uploading of extra files?

Cheers,

Rob

Rob Leclerc, PhD <http://www.linkedin.com/in/robleclerc><https://twitter.com/#!/robleclerc> P: (US) +1-(917)-873-3037 P: (Shanghai) +86-1-(861)-612-5469 Personal Email: rob.leclerc@aya.yale.edu

On Mon, Apr 29, 2013 at 9:15 AM, Dannon Baker <dannon.baker@gmail.com>wrote:

...
Hey Rob,

That example_watch_folder.py does just submit exactly one at a time, executes the workflow, and then does the next all in separate transactions. If you wanted to upload multiple filepaths at once, you'd just append more to the ''filesystem_paths' field (newline separated paths).

-Dannon

On Fri, Apr 26, 2013 at 11:54 PM, Rob Leclerc < robert.leclerc@gmail.com> wrote:

...
I'm looking at example_watch_folder.py and it's not clear from the example how you submit multiple datasets to a library. In the example, the first submit returns a libset [] with only a single entry and then proceeds to iterate through each dataset in the libset in the following section:

data = {}

data['folder_id'] = library_folder_id

data['file_type'] = 'auto'

data['dbkey'] = ''

data['upload_option'] = 'upload_paths'

*data['filesystem_paths'] = fullpath*

data['create_type'] = 'file'

libset = submit(api_key, api_url + "libraries/%s/contents" % library_id, data, return_formatted = False)

time.sleep(5)

for ds in libset:

if 'id' in ds:

wf_data = {}

wf_data['workflow_id'] = workflow['id']

wf_data['history'] = "%s - %s" % (fname, workflow['name'])

wf_data['ds_map'] = {}

for step_id, ds_in in workflow['inputs' ].iteritems():

wf_data['ds_map'][step_id] = {'src':'ld', 'id':ds['id']}

res = submit( api_key, api_url + 'workflows', wf_data, return_formatted=False)

Rob Leclerc, PhD <http://www.linkedin.com/in/robleclerc><https://twitter.com/#!/robleclerc> P: (US) +1-(917)-873-3037 P: (Shanghai) +86-1-(861)-612-5469 Personal Email: rob.leclerc@aya.yale.edu

___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/

4395

Age (days ago)

4398

Last active (days ago)

List overview

Download

6 comments

2 participants

participants (2)

Dannon Baker
Rob Leclerc