Re: [galaxy-dev] Concept for a Galaxy Versioned Fasta Data Retrieval Tool

3 Sep 2014

      Hi,

There have been a few comments about how general we could make the system for Galaxy use or just as a stand-alone command line driven tool.  So some notes below about what I could see it taking on.  Given the scale of the sequencing data problem, I'm sure the Galaxy community has important feedback on this.

I looked at git annex and it appears to me that though it promises to keep track of and synchronize network located files, it doesn't do versioning on them - am I wrong about that?

I also looked at https://code.google.com/p/leveldb/ , also a key value database which relies more heavily on indexes - but I see that though this is well-tuned to answering key queries, it isn't particularly good at storing and retrieving entire versions of a database that could be many gigabytes long, which is our mission.

It is relatively easy to generalize the simple keydb prototype I wrote so that it can handle any key-value database - including binary content and even binary key data, not just text (fasta sequences).  So a name change for the tool is a good idea. 

I want a versioning system that doesn't assume the incoming master file of key-value pairs is in the same order as it was on a previous import run.  I was afraid that any arbitrary change in the order of content on the source server could completely destroy the efficiency of a differential approach.  Git assumes its content is like a document - so it generates a slew of inserts and deletes, in fact provides no benefit, if the fasta entries are rearranged.  I tested helping git overcome this hurdle by converting the fasta content to 1 line key/value fasta entries, and sorting them before git processing. That seemed to work for some smaller and larger nucleotide fasta files (tested 10m to 2gb) but failed when it came to processing protein fasta files; though possibly that was because of the fasta data line length.  That became another concern - thinking that git was failing because each line of the input file was many thousands of characters long.

So having done a "keydb" versioning engine that works and performs as well as git, I am definitely shying away from git now as unreliable on certain kinds of data.  The keydb approach is able to generate a version file at about the same speed that it takes to read the latest version of the same db, i.e. at 50mb/s on a standard hard drive.

An extension to keydb that enables it to take in just a list of adds or deletes or updates is desirable but that can come later.  More efficiency can be had by fine-tuning the updates so that one whole line of key-value doesn't have to replace the previous one but that's for later too.

A generalization note that the keydb approach works where the keys are a sparse array.  There's nothing stopping the keys from representing a 2D or 3D sparse array of data as long as the coordinates are coded uniquely into the one key list.

For those interested in versioning XML data there is an interesting summary of the challenges here:  http://useless-factor.blogspot.ca/2008/01/matching-diffing-and-merging-xml.h... .  It leaves me thinking that quick versioning of xml data could only be accomplished if it could somehow be converted into a key-value db, i.e. with each top level xml record identified by a unique key.

I could see breaking larger keydb databases up into smaller chunks for data retrieval and fast parallel processing - the usual approach being to separate the sorted key-value db out into files based on the first character or two in the key of each record.

Does this go along with people's expectations?

Cheers,

Damion 

________________________________________
From: Björn Grüning [bjoern.gruening@gmail.com]
Sent: Monday, September 01, 2014 12:47 PM
To: Dooley, Damion; Björn Grüning; galaxy-dev@lists.bx.psu.edu
Cc: Hsiao, William
Subject: Re: [galaxy-dev] Concept for a Galaxy Versioned Fasta Data Retrieval Tool

Am 25.08.2014 um 18:05 schrieb Dooley, Damion:
...
Ok, I'll be very happy to see what you've accomplished there.  I will read through what you've done when I return from vacation in a week!
A key need is to have whatever data comes in show up as linked data in one's history to avoid server overhead;
a second objective was to not need to modify existing workflows - as
long as they could work of data in history that is typed appropriately.
  So your 'select type' solution sounds intreguing!
And certainly interested in your use of git - I tried using git, using a 1-line fasta data format, but git seemed to choke on protein fasta files?
And did it run into performance problems with larger files?  That was my experience.  I think I read its authors say that its upper limit was 15gb.
This is probably true for one large file. I'm storing the entire PDB in
git since a few years. One entry one file and it works fine.

Do you know git annex? https://git-annex.branchable.com/
...
That was the motivation for writing a simple key-value master file diff system that seems to have the same I/O as git on smaller files,
but more reliable for the fasta data case, and no problems with larger files - it outputs a new version in the same time it takes to read a master file.
It has drawbacks though - incoming data to compare master with must be sorted in 1 line fasta format first.
My intention was to create a universal solution for database tracking.
So if you can please design your system in such a way that you can store
arbitrary data, not only fasta files.