galaxy core services vs wrapper.py duplication & caching of .fai files
Folks, I was writing a samtools/mpileup wrapper for our local use. When I delved into how the existing samtools/sam_pileup.py adaptor worked, if found that it has a local copy of the routine to look up the samtools .fai file in sam_fa_indices.loc for "installed" genomes. I then noticed that this routine is duplicated in many different adaptors: [curtish@cheaha galaxy]$ find . -name "*.py" | xargs grep sam_fa_indices.loc ./tools/samtools/sam_pileup.py: seqFile = '%s/sam_fa_indices.loc' % GALAXY_DATA_INDEX_DIR ./tools/samtools/sam_mpileup_view.py: seqFile = '%s/sam_fa_indices.loc' % GALAXY_DATA_INDEX_DIR ./tools/samtools/sam_to_bam.py: cached_seqs_pointer_file = '%s/sam_fa_indices.loc' % options.index_dir ./tools/ngs_rna/cufflinks_wrapper_with_gtf.py: cached_seqs_pointer_file = os.path.join( options.index_dir, 'sam_fa_indices.loc' ) ./tools/ngs_rna/cuffdiff_wrapper.py: cached_seqs_pointer_file = os.path.join( options.index_dir, 'sam_fa_indices.loc' ) ./tools/ngs_rna/cufflinks_wrapper.py: cached_seqs_pointer_file = os.path.join( options.index_dir, 'sam_fa_indices.loc' ) ./tools/ngs_rna/cuffcompare_wrapper.py: cached_seqs_pointer_file = os.path.join( options.index_dir, 'sam_fa_indices.loc' ) ./tools/ngs_rna/cufflinks_wrapper_without_gtf.py: cached_seqs_pointer_file = os.path.join( options.index_dir, 'sam_fa_indices.loc' ) Is there any place in galaxy-core where such a core service lives and could be used by all these adaptors, rather than replicating the code everywhere? As a related question, for fasta genomes from the current history, these wrappers compute the .fai file on the fly, in TMP, then throw it away, every time. Has there been any discussion about storing such derived indices in the dataset's metadata (like the .bai file on a .bam data set), so it gets computed once, then re-used?
Curtis,
[curtish@cheaha galaxy]$ find . -name "*.py" | xargs grep sam_fa_indices.loc ./tools/samtools/sam_pileup.py: seqFile = '%s/sam_fa_indices.loc' % GALAXY_DATA_INDEX_DIR ... ./tools/ngs_rna/cufflinks_wrapper_without_gtf.py: cached_seqs_pointer_file = os.path.join( options.index_dir, 'sam_fa_indices.loc' )
Is there any place in galaxy-core where such a core service lives and could be used by all these adaptors, rather than replicating the code everywhere?
Not yet, but this is definitely needed. However, tools and Galaxy must remain independent , so the location of needed indices should be passed to the tool via the command line rather than having tools call into Galaxy.
As a related question, for fasta genomes from the current history, these wrappers compute the .fai file on the fly, in TMP, then throw it away, every time. Has there been any discussion about storing such derived indices in the dataset’s metadata (like the .bai file on a .bam data set), so it gets computed once, then re-used?
Converted datasets, which subsume indices-as-metadata, can store dataset indices. Extending converted datasets to store indices created on the fly is also very much needed. Any community contributions that address these issues would be most welcome. Best, J.
participants (2)
-
Jeremy Goecks
-
Robert Curtis Hendrickson