How we found the filenames of all of the Galaxy datasets
Hi. During IT housekeeping, we have decided that we must know the filenames of all of the galaxy datasets in our local deployment. Searching the docs and the web yielded no reasonable answer, so we figured we'll just write our own tool. At first, we thought about a Galaxy tool, but found no way to extract all the datasets of all histories. So we went deeper. To the SQL itself. Without further ado, I give you the query: SELECT substring(to_char(history_dataset_association.dataset_id, '000000') from 2 for 3) || '/dataset_' || history_dataset_association.dataset_id || '.dat' , history_dataset_association.create_time, history_dataset_association.name, history_dataset_association.extension, history.name AS history_name,galaxy_user.username FROM history_dataset_association JOIN history ON history_dataset_association.history_id = history.id JOIN galaxy_user ON history.user_id = galaxy_user.id WHERE history_dataset_association.deleted = false AND history_dataset_association.purged = false ORDER BY username, history_name, create_time; We use PostgreSQL, so to run this you save the query as `get_filenames.sql`, and run: psql -U galaxy galaxy_prod -f get_filenames.sql -A > galaxy_export_filenames.csv This prints out a line structured as: 003/dataset_3748.dat|2012-10-09 12:07:37.786751| index.fastq|fastq|nan|sean Which means: folder/filename | time of creation | name as appears in history | format | history name | history owner Only files that were not deleted in galaxy are printed (you can remove the "WHERE" line if you want them all). By the way, one of the main struggles was finding the folder/filename. We have searched the database repeatedly for a filename field to no avail, until someone recognised the fact that they can be derived from the dataset_id. If you zero pad the id to six chars, the first three chars would be your folder name. The whole id is the file name. e.g.: dataset_id = 12345 folder = 012 filename = dataset_12345.dat Yaron. Search engine candies: export migrate files datasets filenames on disk
Hi Yaron This is useful - Thank you very much. Coming up with such a sql statement to identify the files from people who have left our institute has been on my list of things to do for quite some time. Now I just need to figure how to distinguish the files which have been shared with people who are still here. Regards, Hans-Rudolf PS: for those of us who got stuck with a MySQL database, this is the corresponding select statement: SELECT CONCAT(substring(lpad(history_dataset_association.dataset_id, 6, 000000) from 1 for 3), '/dataset_', history_dataset_association.dataset_id, '.dat') AS file, .... On 10/30/2013 10:00 AM, Yaron de Leeuw wrote:
Hi.
During IT housekeeping, we have decided that we must know the filenames of all of the galaxy datasets in our local deployment. Searching the docs and the web yielded no reasonable answer, so we figured we'll just write our own tool. At first, we thought about a Galaxy tool, but found no way to extract all the datasets of all histories. So we went deeper. To the SQL itself.
Without further ado, I give you the query:
SELECT substring(to_char(history_dataset_association.dataset_id, '000000') from 2 for 3) || '/dataset_' || history_dataset_association.dataset_id || '.dat' , history_dataset_association.create_time, history_dataset_association.name, history_dataset_association.extension, history.name AS history_name,galaxy_user.username FROM history_dataset_association JOIN history ON history_dataset_association.history_id = history.id JOIN galaxy_user ON history.user_id = galaxy_user.id WHERE history_dataset_association.deleted = false AND history_dataset_association.purged = false ORDER BY username, history_name, create_time;
We use PostgreSQL, so to run this you save the query as `get_filenames.sql`, and run:
psql -U galaxy galaxy_prod -f get_filenames.sql -A > galaxy_export_filenames.csv
This prints out a line structured as: 003/dataset_3748.dat|2012-10-09 12:07:37.786751| index.fastq|fastq|nan|sean
Which means: folder/filename | time of creation | name as appears in history | format | history name | history owner
Only files that were not deleted in galaxy are printed (you can remove the "WHERE" line if you want them all).
By the way, one of the main struggles was finding the folder/filename. We have searched the database repeatedly for a filename field to no avail, until someone recognised the fact that they can be derived from the dataset_id. If you zero pad the id to six chars, the first three chars would be your folder name. The whole id is the file name. e.g.: dataset_id = 12345 folder = 012 filename = dataset_12345.dat
Yaron.
Search engine candies: export migrate files datasets filenames on disk
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
participants (2)
-
Hans-Rudolf Hotz
-
Yaron de Leeuw