New subject: How we found the filenames of all of the Galaxy datasets

30 Oct 2013

      Hi.

During IT housekeeping, we have decided that we must know the filenames
of all of the galaxy datasets in our local deployment. Searching the docs
and
the web yielded no reasonable answer, so we figured we'll just write our
own tool.
At first, we thought about a Galaxy tool, but found no way to extract all
the
datasets of all histories. So we went deeper. To the SQL itself.

Without further ado, I give you the query:

    SELECT substring(to_char(history_dataset_association.dataset_id,
'000000') from 2 for 3) ||  '/dataset_' ||
history_dataset_association.dataset_id || '.dat' ,
history_dataset_association.create_time, history_dataset_association.name,
history_dataset_association.extension, history.name AS
history_name,galaxy_user.username
    FROM history_dataset_association
    JOIN history ON history_dataset_association.history_id = history.id
    JOIN galaxy_user ON history.user_id = galaxy_user.id
    WHERE history_dataset_association.deleted = false AND
history_dataset_association.purged = false
    ORDER BY  username, history_name, create_time;

We use PostgreSQL, so to run this you save the query as
`get_filenames.sql`, and run:

    psql -U galaxy galaxy_prod -f get_filenames.sql -A >
galaxy_export_filenames.csv

This prints out a line structured as:
003/dataset_3748.dat|2012-10-09 12:07:37.786751| index.fastq|fastq|nan|sean

Which means:
  folder/filename  |  time of creation  |  name as appears in history  |
format  |  history name |  history owner

Only files that were not deleted in galaxy are printed (you can remove the
"WHERE"
line if you want them all).

By the way, one of the main struggles was finding the folder/filename. We
have searched
the database repeatedly for a filename field to no avail, until someone
recognised
the fact that they can be derived from the dataset_id. If you zero pad the
id to six chars,
the first three chars would be your folder name. The whole id is the file
name. e.g.:
dataset_id = 12345
folder = 012
filename = dataset_12345.dat

Yaron.

Search engine candies:   export migrate files datasets filenames on disk

How we found the filenames of all of the Galaxy datasets

Yaron de Leeuw

Hans-Rudolf Hotz

tags

participants (2)