James, In addition to Ross's comments, consider that the base class of all Galaxy datasets ( library datasets, etc ) is the DatasetInstance class, whose get_file_name() method leverages the file_name property from the Dataset class ( see ~/lib/galaxy/model/__init__.py ). This file_name property is the pointer to the disk file for all Galaxy datasets that are opened for reading. At first glance, I don't see a lot of calls to open these files in the Galaxy library framework, but it may still be problematic to handle even a few. I believe most of these calls are in the Galaxy job and job related components ( metadata setting, etc ). As Ross suggests, it may be a better approach to consider using the Galaxy API to translate your db queries into actual Galaxy data library dataset files if that would work for you. On Nov 8, 2011, at 10:16 PM, Ross wrote:
Hi James,
Existing tools mostly take file paths. This has arguably useful side effects in isolating cluster node execution from the galaxy server process and persisting the computation input on a file-system - arguably more stable long term than queries on dynamic database tables - and maybe not somewhere you want or need to invest a lot of effort.
Depends what your users need to make use of existing Galaxy tools - eg a specific experiment's sequence data in (eg) fastq format? Take a look under scripts/api - maybe you could write scripted queries into galaxy libraries automagically. Alternatively, if you want users to script their own extractions, not hard to write a new Galaxy tool that writes a new (eg) fastq file based on a database query with parameters supplied by the user. That fastq file appears in a history where it is then available for all existing Galaxy fastq tools and is a shareable persistent object for replicable analysis.
On Tue, Nov 8, 2011 at 9:39 PM, James Ireland <jireland@5amsolutions.com> wrote:
Hi Greg,
So, here are my concerns: 1. From looking through some of the source it *appears* to me that the raw data input calls are spread across the various libraries as standard file IO calls. So, if I wanted to use my db underneath I'd need to replace/catch all of these. I was hoping that there would be fewer points of customization required.
2. Even solving (1), when tools are called from Galaxy like the SAM tools, ClustalW, etc I am assuming that behind the scenes these apps are being passed file paths. I know that's how I've wrapped my own tools in Galaxy. So, I would need to instantiate my data to file at that point. That would mean adding some more special sauce to catch whenever a file path is being passed out to a tool and make sure the file gets created first.
My HUGE caveat is that I still haven't spent much time with the source so I could be way off on these concerns - but this is my impression. I'd welcome enlightenment if I'm wrong!
Thanks, -J
On Tue, Nov 8, 2011 at 3:13 PM, Greg Von Kuster <greg@bx.psu.edu> wrote:
Hi James, I haven't gone too far down the implementation path in this area, so I'm certainly not aware of the issues you may be discovering. The key would be to implement a layer on top of your database so that Galaxy's data library upload component can treat the data contained in your database just like it treats the content of a file on the file system. Since Galaxy must open and read data files stored on the file system in order to use them as input to Galaxy tools, it should be able to do the same for data made available from a database table (I would assume, but again, I'm not completely sure of the potential issues). The data files resulting from the execution of these Galaxy tools would of course be files on the file system within the Galaxy default file store. By "external tools" do you mean tools that are not a part of the Galaxy instance?
On Nov 8, 2011, at 5:14 PM, James Ireland wrote:
Hi Greg,
Did more digging around today in the Galaxy source and maybe I misjudged the situation. Although getting a representation of my datasets into Galaxy appears relatively straightforward, at the end of the day reads of raw data and passing data to and from external tools, etc all assumes the data is sitting in a file, correct?
Thanks again, -J
On Mon, Nov 7, 2011 at 6:29 PM, Greg Von Kuster <greg@bx.psu.edu> wrote:
Hi James, Since genomic data files are often very large, Galaxy does not store them in a database, so this specific scenario has not been implemented as far as I know. However, you may be able to implement what you've described without too much difficulty. If you could implement a layer on top of your database that leverages Galaxy's features for uploading a directory of files or file system paths (maybe better in this case) without copying the data into Galaxy's default file store, it should be fairly trivial to make Galaxy work with it. Using this combination, Galaxy will read the data (without making any changes to it) in order to generate metadata associated with the data. The metadata is stored separately from the raw data. I was at the Pac Bio meeting, so we definitely met there. Good to hear from you! On Nov 7, 2011, at 8:58 PM, James Ireland wrote:
Hi Greg,
Thanks for the fast response! I think we might have met last year at the PacBio 3rd party software vendor meeting.
So, I had seen the documents for the data repository and the"Uploading a Directory of Files" with the "Copy data into Galaxy?" option de-selected seems the closest analog to what I want to do. In my complete and utterly naive understanding of how Galaxy works, if I could wrap my data repository (in this case, my db) with the same sort of functionality as a file directory (scan, load, etc) then I would guess that the integration wouldn't be that painful. Obviously, this would require custom development. This is important enough to my company that we'd be willing to work on doing this - but I'm guessing I'm way off base?
This seems like it would be a fairly common request - to your knowledge, has anyone outside Galaxy rolled their own solution along these lines?
Thanks again, -J
On Mon, Nov 7, 2011 at 11:41 AM, Greg Von Kuster <greg@bx.psu.edu> wrote:
Hello James, This is not currently possible - the options for uploading files to Galaxy data libraries is documented in our wiki at http://wiki.g2.bx.psu.edu/Admin/Data%20Libraries/Uploading%20Library%20Files On Nov 7, 2011, at 2:11 PM, James Ireland wrote:
Greetings!
I would like to expose data I have in a relational database as a data library in Galaxy. I would really like to do this without Galaxy having to make a local copy of the data to the file system. Is this possible and could you point me to any code examples and/or documentation?
I'm sure this must be covered somewhere in the documentation or mailing list, but I haven't been able to find it.
Thanks for your help!
-James -- J Ireland www.5amsolutions.com | Software for Life(TM) m: 415 484-DATA (3282) ___________________________________________________________
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Greg Von Kuster Galaxy Development Team greg@bx.psu.edu