[galaxy-commits] galaxy-dist commit 51aaa9eac818: Updates to LPS tool help text.

20 Nov 2010

# HG changeset patch -- Bitbucket.org
# Project galaxy-dist
# URL http://bitbucket.org/galaxy/galaxy-dist/overview
# User Richard Burhans <burhans@bx.psu.edu>
# Date 1285871526 14400
# Node ID 51aaa9eac8182b25dbbffd1de902b78c916cb0f5
# Parent  ddf70ed04c8b24f9095d4b16f6041907f7c6d428
Updates to LPS tool help text.

--- a/tools/human_genome_variation/lps.xml
+++ b/tools/human_genome_variation/lps.xml
@@ -120,7 +120,7 @@
         <param name="c1" type="float" value="1e-3" help="Parameter defining the margin by which the first-order step is required to decrease before being taken."><validator type="in_range" message="0.0 < c1 < 1.0" min="0.0" max="1.0"/></param>
-        <param name="maxIter" type="integer" value="10000" label="Maximum number of iterations"/>
+        <param name="maxIter" type="integer" value="10000" label="Maximum number of iterations" help="Terminate with error if we exceed this."/><param name="stopTol" type="float" value="1e-6" label="Stop tolerance" help="Convergence tolerance for target value of lambda."/><param name="intermediateTol" type="float" value="1e-4" label="Intermediate tolerance" help="Convergence tolerance for intermediate values of lambda."/><param name="finalOnly" type="select" format="integer" label="Final only">
@@ -132,8 +132,8 @@
   </inputs><outputs>
-    <data name="output_file" format="tabular"/>
-    <data name="log_file" format="txt"/>
+    <data name="output_file" format="tabular" label="${tool.name} on ${on_string}: results"/>
+    <data name="log_file" format="txt" label="${tool.name} on ${on_string}: log"/></outputs><requirements>
@@ -188,28 +188,52 @@ There is a second output dataset (a log)
 
 **What it does**
 
-The LASSO-Patternsearch algorithm efficiently identifies patterns of multiple
-dichotomous risk factors for outcomes of interest in demographic and genomic
-studies.  It is designed for the case where there is a possibly very large
-number of candidate patterns but it is believed that only a relatively small
-number are important.
+The LASSO-Patternsearch algorithm fits your dataset to an L1-regularized
+logistic regression model.  A benefit of using L1-regularization is
+that it typically yields a weight vector with relatively few non-zero
+coefficients.
 
-If the "risky" direction (with respect to the outcome of interest) is known
-for all or almost all variables, the results are readily interpretable.
-If the risky direction is coded correctly for all of the variables, the
-fitted model can be expected to be sparser than that for any other coding.
-However, if a small number of risky variables are coded in the "wrong" way,
-this usually can be detected.
+For example, say you have a dataset containing M rows (subjects)
+and N columns (attributes) where one of these N attributes is binary,
+indicating whether or not the subject has some property of interest P.
+In simple terms, LPS calculates a weight for each of the other attributes
+in your dataset.  This weight indicates how "relevant" that attribute
+is for predicting whether or not a given subject has property P.
+The L1-regularization causes most of these weights to be equal to zero,
+which means LPS will find a "small" subset of the remaining N-1 attributes
+in your dataset that can be used to predict P.
 
-The input file is tabular with rows representing individuals and columns
-representing variables.  There is one special column, the label column,
-containing +1 for cases, and -1 for controls.  The other columns should be
-0 or 1, with 1 representing the expected riskier value for each variable.
-For instance with SNPs the column would have a 1 if the individual (row)
-has the risk allele, or a 0 otherwise.  The output file has one line for each
-variable, or "feature" in the input file, with a single column containing the
-calculated score for that feature.  The log file provides information about
-the input and the internal values obtained during the computation process.
+In other words, LPS can be used for feature selection.
+
+The input dataset is tabular, and must contain a label column which
+indicates whether or not a given row has property P.  In the current
+version of this tool, P must be encoded using +1 and -1.  The Lambda_fac
+parameter ranges from 0 to 1, and controls how sparse the weight
+vector will be.  At the low end, when Lambda_fac = 0, there will be
+no regularization.  At the high end, when Lambda_fac = 1, there will be
+"too much" regularization, and all of the weights will equal zero.
+
+The LPS tool creates two output datasets.  The first, called the results
+file, is a tabular dataset containing one column of weights for each
+value of the regularization parameter lambda that was tried.  The weight
+columns are in order from left to right by decreasing values of lambda.
+The first N-1 rows in each column are the weights for the N-1 attributes
+in your input dataset.  The final row is a constant, the intercept.
+
+Let **x** be a row from your input dataset and let **b** be a column
+from the results file.  To compute the probability that row **x** has
+a label value of +1:
+
+  Probability(row **x** has label value = +1) = 1 / [1 + exp{**x** \* **b**\[1..n-1\] + **b**\[n\]}]
+
+where **x** \* **b**\[1..n-1\] represents matrix multiplication.
+
+The second output dataset, called the log file, is a text file which
+contains additional data about the fitted L1-regularized logistic
+regression model.  These data include the number of features, the
+computed value of lambda_max, the actual values of lambda used, the
+optimal values of the log-likelihood and regularized log-likelihood
+functions, the number of non-zeros, and the number of iterations.
 
 Website: http://pages.cs.wisc.edu/~swright/LPS/
 
@@ -235,11 +259,16 @@ Website: http://pages.cs.wisc.edu/~swrig
 
 - output log file::
 
-    Data set has 100 vectors with 50 features
-    calculateLambdaMax: n=50, m=100, m+=50, m-=50
-    computed value of lambda_max: 5.0000e-01
-    lambda=2.50e-02 solution has     10 nonzeros.
-    It required   546 iterations
+    Data set has 100 vectors with 50 features.
+      calculateLambdaMax: n=50, m=100, m+=50, m-=50
+      computed value of lambda_max: 5.0000e-01
+     
+    lambda=2.96e-02 solution:
+      optimal log-likelihood function value: 6.46e-01
+      optimal *regularized* log-likelihood function value: 6.79e-01
+      number of nonzeros at the optimum:      5
+      number of iterations required:     43
+    etc.
 
 -----
 

--- a/test-data/lps_arrhythmia_log.txt
+++ b/test-data/lps_arrhythmia_log.txt
@@ -1,5 +1,4 @@
-Data set has 452 vectors with 279 features
-
+Data set has 452 vectors with 279 features.
  Sampled 452 points out of 452
  calculateLambdaMax: n=279, m=452, m+=245, m-=207
  computed value of lambda_max: 1.8231e+02
@@ -52,36 +51,147 @@ iter    1, gpnorm=1.7618e-09, nonzero=  
  **** Initial point: nz=1, f=  0.689609056404, lambda= 5.469e+00
 iter    1, gpnorm=1.7618e-09, nonzero=   1 (  0.4%), function=6.896090564044e-01, alpha=3.2768e-01
  Function evals = 2,  Gradient evals =        1.0
-lambda=1.64e+02 solution has      1 nonzeros.
-It required     6 iterations
 
-lambda=1.17e+02 solution has      1 nonzeros.
-It required     1 iterations
+lambda=1.64e+02 solution:
+  optimal log-likelihood function value: 6.90e-01
+  optimal *regularized* log-likelihood function value: 6.90e-01
+  number of non-zeros at the optimum:      1
+  number of iterations required:      6
+  prediction using this solution:
+    54.20% of vectors were correctly predicted.
+    245 correctly predicted.
+    207 in +1 predicted to be in -1.
+    0 in -1 predicted to be in +1.
+    0 in +1 with 50/50 chance.
+    0 in -1 with 50/50 chance.
 
-lambda=8.31e+01 solution has      1 nonzeros.
-It required     1 iterations
+lambda=1.17e+02 solution:
+  optimal log-likelihood function value: 6.90e-01
+  optimal *regularized* log-likelihood function value: 6.90e-01
+  number of non-zeros at the optimum:      1
+  number of iterations required:      1
+  prediction using this solution:
+    54.20% of vectors were correctly predicted.
+    245 correctly predicted.
+    207 in +1 predicted to be in -1.
+    0 in -1 predicted to be in +1.
+    0 in +1 with 50/50 chance.
+    0 in -1 with 50/50 chance.
 
-lambda=5.91e+01 solution has      1 nonzeros.
-It required     1 iterations
+lambda=8.31e+01 solution:
+  optimal log-likelihood function value: 6.90e-01
+  optimal *regularized* log-likelihood function value: 6.90e-01
+  number of non-zeros at the optimum:      1
+  number of iterations required:      1
+  prediction using this solution:
+    54.20% of vectors were correctly predicted.
+    245 correctly predicted.
+    207 in +1 predicted to be in -1.
+    0 in -1 predicted to be in +1.
+    0 in +1 with 50/50 chance.
+    0 in -1 with 50/50 chance.
 
-lambda=4.21e+01 solution has      1 nonzeros.
-It required     1 iterations
+lambda=5.91e+01 solution:
+  optimal log-likelihood function value: 6.90e-01
+  optimal *regularized* log-likelihood function value: 6.90e-01
+  number of non-zeros at the optimum:      1
+  number of iterations required:      1
+  prediction using this solution:
+    54.20% of vectors were correctly predicted.
+    245 correctly predicted.
+    207 in +1 predicted to be in -1.
+    0 in -1 predicted to be in +1.
+    0 in +1 with 50/50 chance.
+    0 in -1 with 50/50 chance.
 
-lambda=3.00e+01 solution has      1 nonzeros.
-It required     1 iterations
+lambda=4.21e+01 solution:
+  optimal log-likelihood function value: 6.90e-01
+  optimal *regularized* log-likelihood function value: 6.90e-01
+  number of non-zeros at the optimum:      1
+  number of iterations required:      1
+  prediction using this solution:
+    54.20% of vectors were correctly predicted.
+    245 correctly predicted.
+    207 in +1 predicted to be in -1.
+    0 in -1 predicted to be in +1.
+    0 in +1 with 50/50 chance.
+    0 in -1 with 50/50 chance.
 
-lambda=2.13e+01 solution has      1 nonzeros.
-It required     1 iterations
+lambda=3.00e+01 solution:
+  optimal log-likelihood function value: 6.90e-01
+  optimal *regularized* log-likelihood function value: 6.90e-01
+  number of non-zeros at the optimum:      1
+  number of iterations required:      1
+  prediction using this solution:
+    54.20% of vectors were correctly predicted.
+    245 correctly predicted.
+    207 in +1 predicted to be in -1.
+    0 in -1 predicted to be in +1.
+    0 in +1 with 50/50 chance.
+    0 in -1 with 50/50 chance.
 
-lambda=1.52e+01 solution has      1 nonzeros.
-It required     1 iterations
+lambda=2.13e+01 solution:
+  optimal log-likelihood function value: 6.90e-01
+  optimal *regularized* log-likelihood function value: 6.90e-01
+  number of non-zeros at the optimum:      1
+  number of iterations required:      1
+  prediction using this solution:
+    54.20% of vectors were correctly predicted.
+    245 correctly predicted.
+    207 in +1 predicted to be in -1.
+    0 in -1 predicted to be in +1.
+    0 in +1 with 50/50 chance.
+    0 in -1 with 50/50 chance.
 
-lambda=1.08e+01 solution has      1 nonzeros.
-It required     1 iterations
+lambda=1.52e+01 solution:
+  optimal log-likelihood function value: 6.90e-01
+  optimal *regularized* log-likelihood function value: 6.90e-01
+  number of non-zeros at the optimum:      1
+  number of iterations required:      1
+  prediction using this solution:
+    54.20% of vectors were correctly predicted.
+    245 correctly predicted.
+    207 in +1 predicted to be in -1.
+    0 in -1 predicted to be in +1.
+    0 in +1 with 50/50 chance.
+    0 in -1 with 50/50 chance.
 
-lambda=7.68e+00 solution has      1 nonzeros.
-It required     1 iterations
+lambda=1.08e+01 solution:
+  optimal log-likelihood function value: 6.90e-01
+  optimal *regularized* log-likelihood function value: 6.90e-01
+  number of non-zeros at the optimum:      1
+  number of iterations required:      1
+  prediction using this solution:
+    54.20% of vectors were correctly predicted.
+    245 correctly predicted.
+    207 in +1 predicted to be in -1.
+    0 in -1 predicted to be in +1.
+    0 in +1 with 50/50 chance.
+    0 in -1 with 50/50 chance.
 
-lambda=5.47e+00 solution has      1 nonzeros.
-It required     1 iterations
+lambda=7.68e+00 solution:
+  optimal log-likelihood function value: 6.90e-01
+  optimal *regularized* log-likelihood function value: 6.90e-01
+  number of non-zeros at the optimum:      1
+  number of iterations required:      1
+  prediction using this solution:
+    54.20% of vectors were correctly predicted.
+    245 correctly predicted.
+    207 in +1 predicted to be in -1.
+    0 in -1 predicted to be in +1.
+    0 in +1 with 50/50 chance.
+    0 in -1 with 50/50 chance.
 
+lambda=5.47e+00 solution:
+  optimal log-likelihood function value: 6.90e-01
+  optimal *regularized* log-likelihood function value: 6.90e-01
+  number of non-zeros at the optimum:      1
+  number of iterations required:      1
+  prediction using this solution:
+    54.20% of vectors were correctly predicted.
+    245 correctly predicted.
+    207 in +1 predicted to be in -1.
+    0 in -1 predicted to be in +1.
+    0 in +1 with 50/50 chance.
+    0 in -1 with 50/50 chance.
+

    

[galaxy-commits] galaxy-dist commit 51aaa9eac818: Updates to LPS tool help text.

commits-noreply＠bitbucket.org