Well well, thanks very much for that reference! I can see how your system to enable a workflow to process delta (diff) data (and merge the results back with a previous run's output) would greatly lighten the processing power for keeping results current. Interesting choice of technologies too. Damion Message: 5 Date: Fri, 5 Sep 2014 10:41:37 +0000 From: Pedersen Edvard <edvard.pedersen@uit.no> To: "galaxy-dev@lists.bx.psu.edu" <galaxy-dev@lists.bx.psu.edu> Subject: Re: [galaxy-dev] Concept for a Galaxy Versioned Fasta Data Retrieval Tool My PhD work may be of interest for this subject, although the primary focus has been on generating databases comprising the changes from a specific timeframe, and was not designed specifically for Galaxy. The similarities between my system and the system you are proposing are that it can generate a BLAST database from any date (that has been added to the system), as well as "diffs" between two dates, and supports FASTA, the Uniprot EMBL variant, full files (which does not give compression benefits) and several others. The system uses delta compression to make sure that non-updated fields do not take up extra space. It uses the Hadoop stack (HBase, HDFS and MapReduce) for parallelism in generating the databases (the blast database generation from FASTA files is not parallel). You can find one of the publications here: <http://bdps.cs.uit.no/papers/hibb13.pdf> http://bdps.cs.uit.no/papers/hibb13.pdf I hope this can be of some use to you. Regards, Edvard Pedersen