Hello Curtis,

The datacache was originally pointed to the data staging area and is now pointed to the data published area. The difference is that the published area contains data and location (.loc) files that are in synch and have completed final testing. It is your choice about whether to use the staged-only data - it depends how risk tolerant your project is and if you plan on testing. But, that said, I think it is almost certainly fine or our team wouldn't have staged it yet. A vanishingly small number of datasets are pulled back once they make it to staging, and this is why we were comfortable pointing datacache there in the first place (were unable to point to the published area at first, but wanted to make the data available ASAP).

Going forward - I can let you know that these indexes are very easy to create: one command-line execution, then add one line to the associated .loc file. Instructions are here, see "
Bowtie and Tophat":
http://wiki.galaxyproject.org/Admin/NGS%20Local%20Setup

For one or few genomes, not a problem. For hundreds of genomes with variants, can become tedious even with helper tools and in our case, the processing interacted with disk that was undergoing changes (as we have been working on system configuration most of the summer). Also, with the Data Manager is now available, creating batch indexes for use via rsync become lower priority. Even so, I would expect more indexes to be fully published once the final configuration is in place, as many are already staged or close being staged (watch the yellow banner on Main).

Hopefully this helps to explain the data, guides you to making an informed decision, and aids with creating your own indexes as needed,

Thanks!
Jen
Galaxy team
 
On 9/18/13 1:04 PM, Curtis Hendrickson (Campus) wrote:

Folks,

 

First, I wanted to thank you for making the datacache available (http://wiki.galaxyproject.org/Admin/Data%20Integration; rsync://datacache.g2.bx.psu.edu). It’s a great resource.

 

However, what is the best way to stay abreast of changes to what’s in datacache, and understand how these indexes are computed?

 

We are currently upgrading to bowtie2, but I notice that the bowtie2 indices for mm9, which used to be in

                rsync://datacache.g2.bx.psu.edu/indexes/mm9/mm9*/bowtie2_index

have been removed, and only the hg19 genome has bowtie2 indices. Why only that one, and not the others?

Where are the scripts you use to make these indices, in case I want to create bowtie2 indices for other

 

So, how do I find out *why* they were removed? (Can I safely use the copy I have, or was there a problem with them?)

 

More generally, how do I understand the policies and logic behind the datacache indices, and be notified of changes, short of running my own periodic rsync/diff?

 

Finally, since I’m doing “reproducible research” is anything planned for systematically versioning genome indices, so I can easily tell what version of a system (ie, what BWA version) was used to create the index, and be sure that an index will not suddenly disappear.

 


Thanks,

Curtis

Research Associate/CTSA-Informatics Team

University of Alabama at Birmingham

 

 

 

 



___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

-- 
Jennifer Hillman-Jackson
http://galaxyproject.org