Hi JJ and mailing list, I think it was John who mailed earlier about implementing an SQLite datatype for proteomics. I sort of like this, since I find myself using SQLite all the time for my own command line tool to do filtering, protein grouping, and relating quant output to ID data. https://github.com/glormph/msstitch — I basically wrote this to combine percolator, mzidentml, and openMS intermediate formats. It’s currently a bit of a mixture of different databases and tables for different operations, but maybe it would be a good idea to have one large proteomics experiment SQLite database. Note that this is nothing I want to push onto people like it would be the next PSI standard, more like a very convenient and fast intermediate datatype. I believe that the proteome discoverer msf files are also based on sqlite. While I’m not sure if it’s a good fit for a server environment since it can get locking/access problems when accessed by too many processes, it’s a very nice fit for a bioinformatic cli tool. Are you guys currently working on any standardized SQLite thing? cheers, — Jorrit Boekel Proteomics systems developer BILS / Lehtiö lab Scilifelab Stockholm, Sweden
On 2/12/15 1:50 AM, Jorrit Boekel wrote:
Hi JJ and mailing list,
I think it was John who mailed earlier about implementing an SQLite datatype for proteomics. I sort of like this, since I find myself using SQLite all the time for my own command line tool to do filtering, protein grouping, and relating quant output to ID data. https://github.com/glormph/msstitch — I basically wrote this to combine percolator, mzidentml, and openMS intermediate formats. It’s currently a bit of a mixture of different databases and tables for different operations, but maybe it would be a good idea to have one large proteomics experiment SQLite database. Note that this is nothing I want to push onto people like it would be the next PSI standard, more like a very convenient and fast intermediate datatype.
I believe that the proteome discoverer msf files are also based on sqlite. While I’m not sure if it’s a good fit for a server environment since it can get locking/access problems when accessed by too many processes, it’s a very nice fit for a bioinformatic cli tool. Are you guys currently working on any standardized SQLite thing?
cheers, — Jorrit Boekel Proteomics systems developer BILS / Lehtiö lab Scilifelab Stockholm, Sweden
Jorrit, We would like to define a Galaxy datatype that subclasses the Galaxy SQLite datatype and defines a schema for proteomics. Our current use case is as a Galaxy dataprovider for a visualization plugin. Ira and I had discussed this last summer, and he quickly put together demo visualization plugin. MSI is currently prototyping a Galaxy visualization plugin using the lorikeet spectral viewer. For feasibility testing, I just used the schema that fell out of the mzR bioconductor package (details below), but we would now like to design a schema that would be responsive, but also generic enough to cover a variety of needs. I'll take a look at what you've done for guidance. Thanks, JJ I've been using this data as a test case for the Visualization plugin: http://proteome.sysbiol.cam.ac.uk/lgatto/RforProteomics/data/iTRAQ_Erwinia_1... http://proteome.sysbiol.cam.ac.uk/lgatto/RforProteomics/data/iTRAQ_Erwinia_1... I generated the SQLite DB using this R script: ## install.packages(c("/Users/jj/src/bio/prot/bioc/mzR_2.0.0.tar.gz"), lib = "/Library/Frameworks/R.framework/Resources/library/",repos=NULL,type="source") library(mzR) library(msdata) library(jsonlite) # Read in scan data z <- openMSfile("iTRAQ_Erwinia.mzML") # Not used in any DB table as yet zri <- runInfo(z) zii <- instrumentInfo(z) # Read in mzIdentML x <- openIDfile("iTRAQ_Erwinia.mzid") ix <- mzidInfo(x) p <- psms(x) m <- modifications(x) s <- score(x) psm <- data.frame(scanNum=sub("^.*=(\\d+)$","\\1",p[,1]),p) # Need to get spectrumID # Then parse out scan num ## pl <- peaks(z,6591) ## toJSON(p[1,]) ## toJSON(header(z,6591)) ## toJSON(peaks(z,6591)) scanl <- 1:runInfo(z)$scanCount moz <- sapply(scanl,function(y) toJSON(peaks(z,y)[,1])) intensity <- sapply(scanl,function(y) toJSON(peaks(z,y)[,2])) pkdf <- data.frame(scanNum = scanl, moz = moz, intensity = intensity) ## mzR to RSQLite library("RSQLite") sqlite <- dbDriver("SQLite") dbconn <- dbConnect(sqlite,"iTRAQ_Erwinia.sqlite") dbWriteTable(dbconn,"psm",psm) dbWriteTable(dbconn,"scan",header(z)) dbWriteTable(dbconn,"score",s) dbWriteTable(dbconn,"modification",m) dbWriteTable(dbconn,"peaks",pkdf) $ sqlite3 iTRAQ_Erwinia.sqlite SQLite version 3.4.0 Enter ".help" for instructions sqlite> .schema CREATE TABLE modification ( "spectrumID" TEXT, "sequence" TEXT, "name" TEXT, "mass" REAL, "location" INTEGER ); CREATE TABLE peaks ( "scanNum" INTEGER, "moz" TEXT, "intensity" TEXT ); CREATE TABLE psm ( "scanNum" TEXT, "spectrumID" TEXT, "chargeState" INTEGER, "rank" INTEGER, "passThreshold" INTEGER, "experimentalMassToCharge" REAL, "calculatedMassToCharge" REAL, "sequence" TEXT, "modNum" INTEGER, "isDecoy" INTEGER, "post" TEXT, "pre" TEXT, "start" INTEGER, "end" INTEGER, "DatabaseAccess" TEXT, "DatabaseSeq" TEXT, "DatabaseDescription" TEXT ); CREATE TABLE scan ( "seqNum" INTEGER, "acquisitionNum" INTEGER, "msLevel" INTEGER, "polarity" INTEGER, "peaksCount" INTEGER, "totIonCurrent" REAL, "retentionTime" REAL, "basePeakMZ" REAL, "basePeakIntensity" REAL, "collisionEnergy" REAL, "ionisationEnergy" REAL, "lowMZ" REAL, "highMZ" REAL, "precursorScanNum" INTEGER, "precursorMZ" REAL, "precursorCharge" INTEGER, "precursorIntensity" REAL, "mergedScan" INTEGER, "mergedResultScanNum" INTEGER, "mergedResultStartScanNum" INTEGER, "mergedResultEndScanNum" INTEGER ); CREATE TABLE score ( "spectrumID" TEXT, "MS_GF_RawScore" REAL, "MS_GF_DeNovoScore" REAL, "MS_GF_SpecEValue" REAL, "MS_GF_EValue" REAL ); -- James E. Johnson Minnesota Supercomputing Institute University of Minnesota
participants (2)
-
Jim Johnson
-
Jorrit Boekel