Re: [galaxy-dev] making data libraries from database tables

9 Nov 2011

      James,

In addition to Ross's comments, consider that the base class of all Galaxy datasets ( library datasets, etc ) is the DatasetInstance class, whose get_file_name() method leverages the file_name property from the Dataset class ( see ~/lib/galaxy/model/__init__.py ).  This file_name property is the pointer to the disk file for all Galaxy datasets that are opened for reading.  At first glance, I don't see a lot of calls to open these files in the Galaxy library framework, but it may still be problematic to handle even a few.  I believe most of these calls are in the Galaxy job and job related components ( metadata setting, etc ).

As Ross suggests, it may be a better approach to consider using the Galaxy API to translate your db queries into actual Galaxy data library dataset files if that would work for you.

On Nov 8, 2011, at 10:16 PM, Ross wrote:
...
Hi James,
Existing tools mostly take file paths. This has arguably useful side
effects in isolating cluster node execution from the galaxy server
process and persisting the computation input on a file-system -
arguably more stable long term than queries on dynamic database tables
- and maybe not somewhere you want or need to invest a lot of effort.
Depends what your users need to make use of existing Galaxy tools - eg
a specific experiment's sequence data in (eg) fastq format? Take a
look under scripts/api - maybe you could write scripted queries into
galaxy libraries automagically. Alternatively, if you want users to
script their own extractions, not hard to write a new Galaxy tool that
writes a new (eg) fastq file based on a database query with parameters
supplied by the user. That fastq file appears in a history where it is
then available for all existing Galaxy fastq tools and is a shareable
persistent object for replicable analysis.
On Tue, Nov 8, 2011 at 9:39 PM, James Ireland <jireland@5amsolutions.com> wrote:
...
Hi Greg,
So, here are my concerns:
1.  From looking through some of the source it *appears* to me that the raw
data input calls are spread across the various libraries as standard file IO
calls.  So, if I wanted to use my db underneath I'd need to replace/catch
all of these.  I was hoping that there would be fewer points of
customization required.
2.  Even solving (1), when tools are called from Galaxy like the SAM tools,
ClustalW, etc I am assuming that behind the scenes these apps are being
passed file paths.  I know that's how I've wrapped my own tools in Galaxy.
So, I would need to instantiate my data to file at that point.  That would
mean adding some more special sauce to catch whenever a file path is being
passed out to a tool and make sure the file gets created first.
My HUGE caveat is that I still haven't spent much time with the source so I
could be way off on these concerns - but this is my impression.  I'd welcome
enlightenment if I'm wrong!
Thanks,
 -J
On Tue, Nov 8, 2011 at 3:13 PM, Greg Von Kuster <greg@bx.psu.edu> wrote:
...
Hi James,
I haven't gone too far down the implementation path in this area, so I'm
certainly not aware of the issues you may be discovering.  The key would be
to implement a layer on top of your database so that Galaxy's data library
upload component can treat the data contained in your database just like it
treats the content of a file on the file system.  Since Galaxy must open and
read data files stored on the file system in order to use them as input to
Galaxy tools, it should be able to do the same for data made available from
a database table (I would assume, but again, I'm not completely sure of the
potential issues).  The data files resulting from the execution of these
Galaxy tools would of course be files on the file system within the Galaxy
default file store.
By "external tools" do you mean tools that are not a part of the Galaxy
instance?
On Nov 8, 2011, at 5:14 PM, James Ireland wrote:
Hi Greg,
Did more digging around today in the Galaxy source and maybe I misjudged
the situation.  Although getting a representation of my datasets into Galaxy
appears relatively straightforward, at the end of the day reads of raw data
and passing data to and from external tools, etc all assumes the data is
sitting in a file, correct?
Thanks again,
 -J
On Mon, Nov 7, 2011 at 6:29 PM, Greg Von Kuster <greg@bx.psu.edu> wrote:
...
Hi James,
Since genomic data files are often very large, Galaxy does not store them
in a database, so this specific scenario has not been implemented as far as
I know.  However, you may be able to implement what you've described without
too much difficulty.  If you could implement a layer on top of your database
that leverages Galaxy's features for uploading a directory of files or file
system paths (maybe better in this case) without copying the data into
Galaxy's default file store, it should be fairly trivial to make Galaxy work
with it.  Using this combination, Galaxy will read the data (without making
any changes to it) in order to generate metadata associated with the data.
 The metadata is stored separately from the raw data.
I was at the Pac Bio meeting, so we definitely met there.  Good to hear
from you!
On Nov 7, 2011, at 8:58 PM, James Ireland wrote:
Hi Greg,
Thanks for the fast response!  I think we might have met last year at the
PacBio 3rd party software vendor meeting.
So, I had seen the documents for the data repository and the"Uploading a
Directory of Files" with the "Copy data into Galaxy?" option de-selected
seems the closest analog to what I want to do.  In my complete and utterly
naive understanding of how Galaxy works, if I could wrap my data repository
(in this case, my db) with the same sort of functionality as a file
directory (scan, load, etc) then I would guess that the integration wouldn't
be that painful.  Obviously, this would require custom development.  This is
important enough to my company that we'd be willing to work on doing this -
but I'm guessing I'm way off base?
This seems like it would be a fairly common request - to your knowledge,
has anyone outside Galaxy rolled their own solution along these lines?
Thanks again,
-J
On Mon, Nov 7, 2011 at 11:41 AM, Greg Von Kuster <greg@bx.psu.edu> wrote:
...
Hello James,
This is not currently possible - the options for uploading files to
Galaxy data libraries is documented in our wiki
at http://wiki.g2.bx.psu.edu/Admin/Data%20Libraries/Uploading%20Library%20Files
On Nov 7, 2011, at 2:11 PM, James Ireland wrote:
Greetings!
I would like to expose data I have in a relational database as a data
library in Galaxy.  I would really like to do this without Galaxy having to
make a local copy of the data to the file system.  Is this possible and
could you point me to any code examples and/or documentation?
I'm sure this must be covered somewhere in the documentation or mailing
list, but I haven't been able to find it.
Thanks for your help!
-James
--
J Ireland
www.5amsolutions.com | Software for Life(TM)
m: 415 484-DATA (3282)
___________________________________________________________
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
http://lists.bx.psu.edu/
Greg Von Kuster
Galaxy Development Team
greg@bx.psu.edu