Programmatically deleting data sets from data libraries?
Hi, Galaxy Developers, I apologize for resurrecting another old thread (http://dev.list.galaxyproject.org/delete-data-library-via-API-td4553000.html), and for this long-winded email... First things first, I am trying to confirm what is suggested in the thread cited above (that data sets cannot be deleted from a Galaxy data library via the API). I am trying to confirm this because I have a vested interested in performing this operation to the extent that I've started investigating modifying the PostgreSQL tables directly if it can't be done via the API (more about this later). The gist of why I am trying to programmatically delete data from a data library is because I am trying to write/implement some custom python code that maintains consistency between a folder on the local filesystem and a corresponding Galaxy data library (i.e bi-directional synchronization). I need do do this in an automated fashion to account for the following two conditions; 1) The file on the filesystem gets deleted (i.e. the path that is referenced in the data library is no longer valid). 2) The MD5 of the file on the filesystem changes (i.e. the file was replaced or modified, and needs to be re-imported such that the correct metadata (i.e. file size) is reported via the Galaxy UI). Based on the limited amount of testing I have done, it doesn't appear to be possible to delete an actual data set from the data library via the API. Here is a test that leads me to believe that this is not possible; 1) I can delete a data library successfully without issue; Here is the output of me doing so: --(galaxy@crigalaxy)-(/group/galaxy/galaxy-dist/scripts/api)--
./delete.py 11f3cb91acb2ab1677f8265bxxxxxxxx http://localhost:8081/api/libraries/e85a3be143d5905b Response
{'synopsis': 'dansully', 'description': 'dansully', 'name': 'dansully'} --(galaxy@crigalaxy)-(/group/galaxy/galaxy-dist/scripts/api)-- 2) Whenever I try to delete an item in the data library, I get a 404, with the response "no action found for ..." --(galaxy@crigalaxy)-(/group/galaxy/galaxy-dist/scripts/api)--
./display.py 11f3cb91acb2ab1677f8265bxxxxxxxx http://localhost:8081/api/libraries/e85a3be143d5905b/contents/62e564808c5368... Member Information
ldda_id: 62e564808c5368d4 misc_blurb: 2 lines name: whatever.txt data_type: txt file_name: /group/galaxy/galaxy-dist/database/files/008/dataset_8938.dat uploaded_by: dansully@uchicago.edu template_data: {} genome_build: ? model_class: LibraryDataset misc_info: uploaded txt file file_size: 329 metadata_data_lines: 2 message: id: 62e564808c5368d4 date_uploaded: 2012-08-29T15:48:38.335445 metadata_dbkey: ? --(galaxy@crigalaxy)-(/group/galaxy/galaxy-dist/scripts/api)--
./delete.py 11f3cb91acb2ab1677f8265bxxxxxxxx http://localhost:8081/api/libraries/e85a3be143d5905b/contents/62e564808c5368... HTTP Error 404: Not Found 404 Not Found The resource could not be found. No action for /api/libraries/e85a3be143d5905b/contents/62e564808c5368d4 --(galaxy@crigalaxy)-(/group/galaxy/galaxy-dist/scripts/api)--
My hope of being able to actually use the API for a delete is that I am either not forming the URL string to correctly delete the data set, or the data{} dict implements a key that I am not aware of (it is my understanding that the Galaxy API is still under active development; so far I have not been able to locate any documentation that suggests any keys or attributes (other than the example code distributed)) that will make a delete operation for an individual data set feasible. Would it be possible for somebody with specific knowledge of the Galaxy API to comment on whether or not this functionality is implemented? All of this being said, if it is *not* possible to delete an individual data set from the Galaxy API, I am prepared to make a 'reasonable' effort to try and do this by modifying the galaxy back-end SQL tables directly. Based on the research I have done (I have enabled mod logging on the PostgreSQL database), here is what a delete operation from the Galaxy UI looks like in terms of database changes: 2012-08-29 08:44:20.361 CDT,"galaxy_","galaxy",7952,"127.0.0.1:46174",503e1a8d.1f10,41,"idle in transaction",2012-08-29 ent: UPDATE library_dataset SET update_time='2012-08-29T13:44:20.361070', deleted=true WHERE library_dataset.id = 6013",$ So, at this point (assuming that I cannot delete the data set from the Galaxy API), I'm trying to work backwards from information I know (i.e. file_name or ldda_id) to discern the library_dataset.id (I'm still digging through database query logs to try to determine how this is done (it appears that there are more than one query executed when a data library is rendered via the Galaxy UI). Determining the id of the dataset in question continues to be an ongoing challenge. Which leads me to one final last question. Could anybody tell me how the ID's (the ldda_id) that get returned by the Galaxy API are calculated? It is it some sort of a hash of composite or primary keys from the back-end tables? The reason why I am asking this is because I did a full dump of the database and searched (using grep) for the ldda_id (i.e. 62e564808c5368d4), and it didn't exist anywhere in the database (I was surprised by this). If anybody out there has programmatically deleted a data set from a Galaxy data library (via the API or other), or could shed some light on how to solve my problem, I'd love to hear from you. Thank-you so much for your time, and again, I apologize for my lengthly e-mail. Dan Sullivan
Hi Dan, I had a similar problem just yesterday. I couldn't enter the 'Data library' via the API anymore because one of the dataset files wasn't available anymore (or actually it was, but someone had a space in the filename...). So, to be able to enter the 'Data library' again, I had to remove the datasets. I guess, there might be a better and nicer way, but at some point I went the hard road and removed the entries in the dataset and library_dataset tables (and some others) directly. Long story short. I cannot really help here :) But what I can say is that the ldda_id is the id of the entry in the library_dataset_dataset_association table (or that's what I expect it to be). Hope it helps a bit! Cheers, Sajoscha On Sep 4, 2012, at 5:52 PM, Dan Sullivan wrote:
Hi, Galaxy Developers,
I apologize for resurrecting another old thread (http://dev.list.galaxyproject.org/delete-data-library-via-API-td4553000.html), and for this long-winded email...
First things first, I am trying to confirm what is suggested in the thread cited above (that data sets cannot be deleted from a Galaxy data library via the API). I am trying to confirm this because I have a vested interested in performing this operation to the extent that I've started investigating modifying the PostgreSQL tables directly if it can't be done via the API (more about this later).
The gist of why I am trying to programmatically delete data from a data library is because I am trying to write/implement some custom python code that maintains consistency between a folder on the local filesystem and a corresponding Galaxy data library (i.e bi-directional synchronization). I need do do this in an automated fashion to account for the following two conditions;
1) The file on the filesystem gets deleted (i.e. the path that is referenced in the data library is no longer valid). 2) The MD5 of the file on the filesystem changes (i.e. the file was replaced or modified, and needs to be re-imported such that the correct metadata (i.e. file size) is reported via the Galaxy UI).
Based on the limited amount of testing I have done, it doesn't appear to be possible to delete an actual data set from the data library via the API. Here is a test that leads me to believe that this is not possible;
1) I can delete a data library successfully without issue; Here is the output of me doing so:
--(galaxy@crigalaxy)-(/group/galaxy/galaxy-dist/scripts/api)--
./delete.py 11f3cb91acb2ab1677f8265bxxxxxxxx http://localhost:8081/api/libraries/e85a3be143d5905b Response
{'synopsis': 'dansully', 'description': 'dansully', 'name': 'dansully'} --(galaxy@crigalaxy)-(/group/galaxy/galaxy-dist/scripts/api)--
2) Whenever I try to delete an item in the data library, I get a 404, with the response "no action found for ..."
--(galaxy@crigalaxy)-(/group/galaxy/galaxy-dist/scripts/api)--
./display.py 11f3cb91acb2ab1677f8265bxxxxxxxx http://localhost:8081/api/libraries/e85a3be143d5905b/contents/62e564808c5368... Member Information
ldda_id: 62e564808c5368d4 misc_blurb: 2 lines name: whatever.txt data_type: txt file_name: /group/galaxy/galaxy-dist/database/files/008/dataset_8938.dat uploaded_by: dansully@uchicago.edu template_data: {} genome_build: ? model_class: LibraryDataset misc_info: uploaded txt file file_size: 329 metadata_data_lines: 2 message: id: 62e564808c5368d4 date_uploaded: 2012-08-29T15:48:38.335445 metadata_dbkey: ? --(galaxy@crigalaxy)-(/group/galaxy/galaxy-dist/scripts/api)--
./delete.py 11f3cb91acb2ab1677f8265bxxxxxxxx http://localhost:8081/api/libraries/e85a3be143d5905b/contents/62e564808c5368... HTTP Error 404: Not Found 404 Not Found The resource could not be found. No action for /api/libraries/e85a3be143d5905b/contents/62e564808c5368d4 --(galaxy@crigalaxy)-(/group/galaxy/galaxy-dist/scripts/api)--
My hope of being able to actually use the API for a delete is that I am either not forming the URL string to correctly delete the data set, or the data{} dict implements a key that I am not aware of (it is my understanding that the Galaxy API is still under active development; so far I have not been able to locate any documentation that suggests any keys or attributes (other than the example code distributed)) that will make a delete operation for an individual data set feasible. Would it be possible for somebody with specific knowledge of the Galaxy API to comment on whether or not this functionality is implemented?
All of this being said, if it is *not* possible to delete an individual data set from the Galaxy API, I am prepared to make a 'reasonable' effort to try and do this by modifying the galaxy back-end SQL tables directly. Based on the research I have done (I have enabled mod logging on the PostgreSQL database), here is what a delete operation from the Galaxy UI looks like in terms of database changes:
2012-08-29 08:44:20.361 CDT,"galaxy_","galaxy",7952,"127.0.0.1:46174",503e1a8d.1f10,41,"idle in transaction",2012-08-29 ent: UPDATE library_dataset SET update_time='2012-08-29T13:44:20.361070', deleted=true WHERE library_dataset.id = 6013",$
So, at this point (assuming that I cannot delete the data set from the Galaxy API), I'm trying to work backwards from information I know (i.e. file_name or ldda_id) to discern the library_dataset.id (I'm still digging through database query logs to try to determine how this is done (it appears that there are more than one query executed when a data library is rendered via the Galaxy UI). Determining the id of the dataset in question continues to be an ongoing challenge.
Which leads me to one final last question. Could anybody tell me how the ID's (the ldda_id) that get returned by the Galaxy API are calculated? It is it some sort of a hash of composite or primary keys from the back-end tables? The reason why I am asking this is because I did a full dump of the database and searched (using grep) for the ldda_id (i.e. 62e564808c5368d4), and it didn't exist anywhere in the database (I was surprised by this).
If anybody out there has programmatically deleted a data set from a Galaxy data library (via the API or other), or could shed some light on how to solve my problem, I'd love to hear from you. Thank-you so much for your time, and again, I apologize for my lengthly e-mail.
Dan Sullivan ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
participants (2)
-
Dan Sullivan
-
Sajoscha Sauer