On 2/12/15 1:50 AM, Jorrit Boekel wrote:
Hi JJ and mailing list,
I think it was John who mailed earlier about implementing an SQLite datatype for
proteomics. I sort of like this, since I find myself using SQLite all the time for my own
command line tool to do filtering, protein grouping, and relating quant output to ID
data.
https://github.com/glormph/msstitch — I basically wrote this to combine percolator,
mzidentml, and openMS intermediate formats. It’s currently a bit of a mixture of different
databases and tables for different operations, but maybe it would be a good idea to have
one large proteomics experiment SQLite database. Note that this is nothing I want to push
onto people like it would be the next PSI standard, more like a very convenient and fast
intermediate datatype.
I believe that the proteome discoverer msf files are also based on sqlite. While I’m not
sure if it’s a good fit for a server environment since it can get locking/access problems
when accessed by too many processes, it’s a very nice fit for a bioinformatic cli tool.
Are you guys currently working on any standardized SQLite thing?
cheers,
—
Jorrit Boekel
Proteomics systems developer
BILS / Lehtiö lab
Scilifelab Stockholm, Sweden
Jorrit,
We would like to define a Galaxy datatype that subclasses the Galaxy SQLite datatype and
defines a schema for proteomics.
Our current use case is as a Galaxy dataprovider for a visualization plugin.
Ira and I had discussed this last summer, and he quickly put together demo visualization
plugin.
MSI is currently prototyping a Galaxy visualization plugin using the lorikeet spectral
viewer.
For feasibility testing, I just used the schema that fell out of the mzR bioconductor
package (details below),
but we would now like to design a schema that would be responsive, but also generic enough
to cover a variety of needs.
I'll take a look at what you've done for guidance.
Thanks,
JJ
I've been using this data as a test case for the Visualization plugin:
http://proteome.sysbiol.cam.ac.uk/lgatto/RforProteomics/data/iTRAQ_Erwini...
http://proteome.sysbiol.cam.ac.uk/lgatto/RforProteomics/data/iTRAQ_Erwini...
I generated the SQLite DB using this R script:
## install.packages(c("/Users/jj/src/bio/prot/bioc/mzR_2.0.0.tar.gz"), lib =
"/Library/Frameworks/R.framework/Resources/library/",repos=NULL,type="source")
library(mzR)
library(msdata)
library(jsonlite)
# Read in scan data
z <- openMSfile("iTRAQ_Erwinia.mzML")
# Not used in any DB table as yet
zri <- runInfo(z)
zii <- instrumentInfo(z)
# Read in mzIdentML
x <- openIDfile("iTRAQ_Erwinia.mzid")
ix <- mzidInfo(x)
p <- psms(x)
m <- modifications(x)
s <- score(x)
psm <- data.frame(scanNum=sub("^.*=(\\d+)$","\\1",p[,1]),p)
# Need to get spectrumID
# Then parse out scan num
## pl <- peaks(z,6591)
## toJSON(p[1,])
## toJSON(header(z,6591))
## toJSON(peaks(z,6591))
scanl <- 1:runInfo(z)$scanCount
moz <- sapply(scanl,function(y) toJSON(peaks(z,y)[,1]))
intensity <- sapply(scanl,function(y) toJSON(peaks(z,y)[,2]))
pkdf <- data.frame(scanNum = scanl, moz = moz, intensity = intensity)
## mzR to RSQLite
library("RSQLite")
sqlite <- dbDriver("SQLite")
dbconn <- dbConnect(sqlite,"iTRAQ_Erwinia.sqlite")
dbWriteTable(dbconn,"psm",psm)
dbWriteTable(dbconn,"scan",header(z))
dbWriteTable(dbconn,"score",s)
dbWriteTable(dbconn,"modification",m)
dbWriteTable(dbconn,"peaks",pkdf)
$ sqlite3 iTRAQ_Erwinia.sqlite
SQLite version 3.4.0
Enter ".help" for instructions
sqlite> .schema
CREATE TABLE modification
( "spectrumID" TEXT,
"sequence" TEXT,
"name" TEXT,
"mass" REAL,
"location" INTEGER
);
CREATE TABLE peaks
( "scanNum" INTEGER,
"moz" TEXT,
"intensity" TEXT
);
CREATE TABLE psm
( "scanNum" TEXT,
"spectrumID" TEXT,
"chargeState" INTEGER,
"rank" INTEGER,
"passThreshold" INTEGER,
"experimentalMassToCharge" REAL,
"calculatedMassToCharge" REAL,
"sequence" TEXT,
"modNum" INTEGER,
"isDecoy" INTEGER,
"post" TEXT,
"pre" TEXT,
"start" INTEGER,
"end" INTEGER,
"DatabaseAccess" TEXT,
"DatabaseSeq" TEXT,
"DatabaseDescription" TEXT
);
CREATE TABLE scan
( "seqNum" INTEGER,
"acquisitionNum" INTEGER,
"msLevel" INTEGER,
"polarity" INTEGER,
"peaksCount" INTEGER,
"totIonCurrent" REAL,
"retentionTime" REAL,
"basePeakMZ" REAL,
"basePeakIntensity" REAL,
"collisionEnergy" REAL,
"ionisationEnergy" REAL,
"lowMZ" REAL,
"highMZ" REAL,
"precursorScanNum" INTEGER,
"precursorMZ" REAL,
"precursorCharge" INTEGER,
"precursorIntensity" REAL,
"mergedScan" INTEGER,
"mergedResultScanNum" INTEGER,
"mergedResultStartScanNum" INTEGER,
"mergedResultEndScanNum" INTEGER
);
CREATE TABLE score
( "spectrumID" TEXT,
"MS_GF_RawScore" REAL,
"MS_GF_DeNovoScore" REAL,
"MS_GF_SpecEValue" REAL,
"MS_GF_EValue" REAL
);
--
James E. Johnson Minnesota Supercomputing Institute University of Minnesota