{
"metadata": {
"name": "WiP: The Earth Is Round -- Significance Testing"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"WiP: The Earth Is Round -- Significance Testing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n\n",
"After performing a classification analysis one is usually interested in\nan evaluation of the results with respect to its statistical uncertainty.\nIn the following we will take at a few possible approaches to get this from\nPyMVPA.\n\n",
"Let's look at a typical setup for a cross-validated classification. We start\nby generating a dataset with 200 samples and 3 features of which only two carry\nsome relevant signal. Afterwards we set up a standard leave-one-chunk-out\ncross-validation procedure for an SVM classifier. At this point we have seen\nthis numerous times, and the code should be easy to read:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from mvpa2.suite import *\n",
"ds = normal_feature_dataset(perlabel=100, nlabels=2, nfeatures=3,\n nonbogus_features=[0,1], snr=0.3, nchunks=2)\n",
"clf = LinearCSVMC()\n",
"partitioner = NFoldPartitioner()"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"cv = CrossValidation(clf,\n partitioner,\n errorfx=mean_match_accuracy,\n postproc=mean_sample())\n",
"acc = cv(ds)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- - -\n**Exercise**",
"\n\n",
"Take a look at the performance statistics of the classifier. Explore how it\nchanges with different values of the signal-to-noise (`snr`) parameter\nof the dataset generator function."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# you can use this cell for this exercise"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- - -\n",
"\n\n",
"The simplest way to get a quick assessment of the statistical uncertainty of\nthe classification accuracy is to look at the standard deviation of the\naccuracies across cross-validation folds. This can be achieved by removing\nthe `postproc` argument of `CrossValidation`.\n\n",
"Another, slightly more informative, approach is to compute confidence intervals\nfor the classification accuracy. We can do this by treating each prediction\nof the classifier as a Bernoulli trial with some success probability.\nIf we further assume statistical independence of these prediction outcomes\nwe can compute ",
"[binomial proportion confidence intervals](https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval) using a variety\nof methods. To implement this calculation we only have to modify the\nerror function and the post processing of our previous analysis setup."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"cv = CrossValidation(\n clf,\n partitioner,\n errorfx=prediction_target_matches,\n postproc=BinomialProportionCI(width=.95, meth='jeffreys'))\n",
"ci_result = cv(ds)\n",
"ci = ci_result.samples[:, 0]\n",
"ci[0] < np.asscalar(acc) < ci[1]"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n\n",
"Instead of computing accuracies we use an error function that returns a boolean\nvector of prediction success for each sample. In the post processing this\ninformation is then used to compute the confidence intervals. We can see that\nthe previously computed accuracy lies within the confidence interval. If the\nassumption of statistical independence of the classifier prediction success\nholds we can be 95% certain that the true accuracy is within this interval."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- - -\n**Exercise**",
"\n\n",
"Think about situations in which we cannot reasonably assume statistical\nindependence of classifier prediction outcomes. Hint: What if the data\nin the testing dataset shows strong auto-correlation?"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# you can use this cell for this exercise"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- - -\n"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"*Null* hypothesis testing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n\n",
"Another way of making statements like ",
"*\"Performance is significantly above\nchance-level\"* is ",
"*Null* hypothesis (aka ",
"*H0*) testing that PyMVPA supports for\nany ",
"[Measure](http://pymvpa.org/generated/mvpa2.measures.base.Measure.html#mvpa2-measures-base-measure).\n\n",
"However, as with other applications of statistics in classifier-based analyses,\nthere is the problem that we typically do not know the distribution of a\nvariable like error or performance under the ",
"*Null* hypothesis (i.e. the\nprobability of a result given that there is no signal), hence we cannot easily\nassign the adored p-values. Even worse, the chance-level or guess probability\nof a classifier depends on the content of a validation dataset, e.g. balanced\nor unbalanced number of samples per label, total number of labels, as well as\nthe peculiarities of \"independence\" of training and testing data -- especially\nin the neuroimaging domain."
]
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Monte Carlo -- here I come!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n\n",
"One approach to deal with this situation is to ",
"*estimate* the ",
"*Null*\ndistribution using permutation testing. The ",
"*Null* distribution is\nestimated by computing the measure of interest multiple times using the original\ndata samples but with permuted targets, presumably scrambling or destroying the\nsignal of interest. Since quite often the exploration of all permutations is\nunfeasible, Monte-Carlo testing (see ",
"*Nichols et al. (2002)*)\nallows us to obtain a stable estimate with only a limited number of random\npermutations.\n\n",
"Given the results computed using permuted targets, we can now determine the\nprobability of the empirical result (i.e. the one computed from the original\ntraining dataset) under the ",
"*no signal* condition. This is simply the fraction\nof results from the permutation runs that is larger or smaller than the\nempirical (depending on whether one is looking at performances or errors).\n\n",
"Here is our previous cross-validation set up:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"cv = CrossValidation(clf,\n partitioner,\n postproc=mean_sample(),\n enable_ca=['stats'])\n",
"err = cv(ds)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n\n",
"Now we want to run this analysis again, repeatedly and with a fresh\npermutation of the targets for each run. We need two pieces for the Monte\nCarlo shuffling. The first is an instance of an\n",
"[AttributePermutator](http://pymvpa.org/generated/mvpa2.generators.permutation.AttributePermutator.html#mvpa2-generators-permutation-attributepermutator) that will\npermute the target attribute of the dataset for each iteration. We\nwill instruct it to perform 200 permutations. In a real analysis, the\nnumber of permutations will often be more than that."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"permutator = AttributePermutator('targets', count=200)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- - -\n**Exercise**",
"\n\n",
"The `permutator` is a generator. Try generating all 200 permuted\ndatasets."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# you can use this cell for this exercise"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- - -\n",
"\n\n",
"The second necessary component for a Monte-Carlo-style estimation of the ",
"*Null*\ndistribution is the actual \"estimator\". ",
"[MCNullDist](http://pymvpa.org/generated/mvpa2.clfs.stats.MCNullDist.html#mvpa2-clfs-stats-mcnulldist)\nwill use the already created `permutator` to shuffle the targets and later on\nreport the p-value from the left tail of the ",
"*Null* distribution, because we are\ngoing to compute errors and we are interested in them being ",
"*lower* than chance.\nFinally, we also ask for all results from Monte-Carlo shuffling to be stored for\nsubsequent visualization of the distribution."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"distr_est = MCNullDist(permutator, tail='left', enable_ca=['dist_samples'])"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n\n",
"The rest is easy. Measures take an optional constructor argument `null_dist`\nthat can be used to provide an instance of some\n",
"[NullDist](http://pymvpa.org/generated/mvpa2.clfs.stats.NullDist.html#mvpa2-clfs-stats-nulldist) estimator -- and we have just created one!\nBecause a cross-validation is nothing but a measure, we can assign it our ",
"*Null*\ndistribution estimator, and it will also perform permutation testing, in\naddition to the regular classification analysis on the \"real\" dataset.\nConsequently, the code hasn't changed much:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"cv_mc = CrossValidation(clf,\n partitioner,\n postproc=mean_sample(),\n null_dist=distr_est,\n enable_ca=['stats'])\n",
"err = cv_mc(ds)\n",
"cv.ca.stats.stats['ACC'] == cv_mc.ca.stats.stats['ACC']"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n\n",
"Other than it taking a bit longer to compute, the performance did not change.\nBut the additional waiting wasn't in vain, as we get the results of the\nstatistical evaluation. The `cv_mc` ",
"[conditional attribute](http://pymvpa.org/glossary.html#term-conditional-attribute)\n`null_prob` has a dataset that contains the p-values representing the\nlikelihood of an empirical value (i.e. the result from analysing the original\ndataset) being equal or lower to one under the ",
"*Null* hypothesis, i.e. no\nactual relevant signal in the data. Or in more concrete terms, the p-value\nis the fraction of permutation results less than or\nequal to the empirical result."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"p = cv_mc.ca.null_prob\n",
"p.shape"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"np.asscalar(p) < 0.1"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- - -\n**Exercise**",
"\n\n",
"How many cross-validation analyses were computed when running `cv_mc`?\nMake sure you are not surprised that it is more than 200.\nWhat is the minimum p-value that we can get from 200 permutations?"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# you can use this cell for this exercise"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- - -\n",
"\n\n",
"Let's practise our visualization skills a bit and create a quick plot to\nshow the ",
"*Null* distribution and how \"significant\" our\nempirical result is. And let's make a function for plotting to show off\nour Python-foo!"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def make_null_dist_plot(dist_samples, empirical):\n pl.hist(dist_samples, bins=20, normed=True, alpha=0.8)\n pl.axvline(empirical, color='red')\n # chance-level for a binary classification with balanced samples\n pl.axvline(0.5, color='black', ls='--')\n # scale x-axis to full range of possible error values\n pl.xlim(0,1)\n pl.xlabel('Average cross-validated classification error')\n",
"_ = pl.figure()\n",
"make_null_dist_plot(np.ravel(cv_mc.null_dist.ca.dist_samples),\n np.asscalar(err))"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n\n",
"You can see that we have created a histogram of the \"distribution samples\" stored\nin the ",
"*Null* distribution (because we asked for it previously). We can also\nsee that the ",
"*Null* or chance distribution is centered around the expected\nchance-level and the empirical error value is in the far left tail, thus\nrelatively unlikely to be a ",
"*Null* result, hence the low-ish p-value.\n\n",
"This wasn't too bad, right? We could stop here. But there is this smell...."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- - -\n**Exercise**",
"\n\n",
"The p-value that we have just computed and the ",
"*Null* distribution we looked\nat are, unfortunately, ",
"**invalid** -- at least if we want to know how likely\nit is to obtain our ",
"**empirical** result under a no-signal condition. Can you\nfigure out why?\n\n",
"PS: The answer is obviously in the next section, so do not spoil your learning\nexperience by reading it before you have thought about this issue!"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# you can use this cell for this exercise"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- - -\n"
]
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Avoiding the trap OR Advanced magic 101"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n\n",
"Here is what went wrong: The dataset's class labels (aka targets) were shuffled\nrepeatedly, and for each iteration a full cross-validation of classification\nerror was computed. However, the shuffling was done on the ",
"*full* dataset,\nhence target values were permuted in both training ",
"*and* testing dataset\nportions in each CV-fold. This basically means that for each Monte Carlo\niteration the classifier was ",
"**tested** on new data/signal.\nHowever, we are actually interested in what the classifier has to say about the\n",
"*actual* data, but when it was ",
"**trained** on randomly permuted data.\n\n",
"Doing a whole-dataset permutation is a common mistake with very beneficial\nside-effects -- as you will see in a bit. Sadly, doing the permuting correctly (i.e.\nin the training portion of the dataset only) is a bit more complicated due to\nthe data-folding scheme that we have to deal with. Here is how it goes:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"repeater = Repeater(count=200)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n\n",
"A `repeater` is a simple node that returns any given dataset a configurable\nnumber of times. We use this helper to configure the number of Monte Carlo\niterations."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- - -\n**Exercise**",
"\n\n",
"A ",
"[Repeater](http://pymvpa.org/generated/mvpa2.generators.base.Repeater.html#mvpa2-generators-base-repeater) is also a generator. Try calling it\nwith our dataset. What does it do? How can you get it to produce the 200\ndatasets?"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# you can use this cell for this exercise"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- - -\n",
"\n\n",
"The new `permutator` is again configured to shuffle the `targets`\nattribute. But this time only ",
"*once* and only for samples that were labeled as\nbeing part of the training set in a particular CV-fold. The `partitions`\nsample attribute is created by the NFoldPartitioner that we have already\nconfigured earlier (or any other partitioner in PyMVPA for that matter)."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"permutator = AttributePermutator('targets',\n limit={'partitions': 1},\n count=1)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n\n",
"The most significant difference is that we are now going to use a dedicate\nmeasure to estimate the ",
"*Null* distribution. That measure is very similar\nto the cross-validation we have used before, but differs in an important twist:\nwe use a chained generator to perform the data-folding. This chain comprises\nof our typical partitioner (marks one chunk as testing data and the rest as\ntraining, for all chunks) and the new one-time permutator. This chain-generator\ncauses the cross-validation procedure to permute the training data only for each\ndata-fold and leave the testing data untouched. Note, that we make the chain use\nthe `space` of the partitioner, to let the `CrossValidation` know which\nsamples attribute defines training and testing partitions."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"null_cv = CrossValidation(\n clf,\n ChainNode(\n [partitioner, permutator],\n space=partitioner.get_space()),\n postproc=mean_sample())"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- - -\n**Exercise**",
"\n\n",
"Create a separate chain-generator and explore what it does. Remember: it is\njust a generator."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# you can use this cell for this exercise"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- - -\n",
"\n\n",
"Now we create our new and improved distribution estimator. This looks similar\nto what we did before, but we now use our dedicated ",
"*Null* cross-validation\nmeasure, and run it as often as `repeater` is configured to estimate the\n",
"*Null* performance."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"distr_est = MCNullDist(repeater, tail='left',\n measure=null_cv,\n enable_ca=['dist_samples'])"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n\n",
"On the \"outside\" the cross-validation measure for computing the empricial\nperformance estimate is 100% identical to what we have used before. All the\nmagic happens inside the distribution estimator."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"cv_mc_corr = CrossValidation(clf,\n partitioner,\n postproc=mean_sample(),\n null_dist=distr_est,\n enable_ca=['stats'])\n",
"err = cv_mc_corr(ds)\n",
"cv_mc_corr.ca.stats.stats['ACC'] == cv_mc.ca.stats.stats['ACC']"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"cv_mc.ca.null_prob.samples < cv_mc_corr.ca.null_prob.samples"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n\n",
"After running it we see that there is no change in the empirical performance\n(great!), but our significance did suffer (poor thing!). We can take a look\nat the whole picture by plotting our previous ",
"*Null* distribution estimate\nand the new, improved one as an overlay."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"make_null_dist_plot(cv_mc.null_dist.ca.dist_samples, np.asscalar(err))\n",
"make_null_dist_plot(cv_mc_corr.null_dist.ca.dist_samples, np.asscalar(err))"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n\n",
"It should be obvious that there is a substantial difference in the two\nestimates, but only the latter/wider distribution is valid!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- - -\n**Exercise**",
"\n\n",
"Keep it in mind. Keep it in mind. Keep it in mind."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# you can use this cell for this exercise"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- - -\n"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"The following content is incomplete and experimental"
]
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"If you have a clue"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n\n",
"There a many ways to further tweak the statistical evaluation. For example, if\nthe family of the distribution is known (e.g. Gaussian/Normal) and provided via\nthe `dist_class` parameter of `MCNullDist`, then permutation tests samples\nwill be used to fit this particular distribution and estimate distribution\nparameters. This could yield enormous speed-ups. Under the (strong) assumption\nof Gaussian distribution, 20-30 permutations should be sufficient to get\nsensible estimates of the distribution parameters. Fitting a normal distribution\nwould look like this. Actually, only a single modification is necessary\n(the `dist_class` argument), but we will also reduce the number permutations."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"distr_est = MCNullDist(Repeater(count=200),\n dist_class=scipy.stats.norm,\n tail='left',\n measure=null_cv,\n enable_ca=['dist_samples'])\n",
"cv_mc_norm = CrossValidation(clf,\n partitioner,\n postproc=mean_sample(),\n null_dist=distr_est,\n enable_ca=['stats'])\n",
"err = cv_mc_norm(ds)\n",
"distr = cv_mc_norm.null_dist.dists()[0]\n",
"make_null_dist_plot(cv_mc_norm.null_dist.ca.dist_samples,\n np.asscalar(err))\n",
"x = np.linspace(0,1,100)\n",
"_ = pl.plot(x, distr.pdf(x), color='black', lw=2)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Family-friendly"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n\n",
"When going through this chapter you might have thought: \"Jeez, why do they need\nto return a single p-value in a freaking dataset?\" But there is a good reason\nfor this. Lets set up another cross-validation procedure. This one is basically\nidentical to the last one, except for not averaging classifier performances\nacross data-folds (i.e. `postproc=mean_sample()`)."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"cvf = CrossValidation(\n clf,\n partitioner,\n null_dist=MCNullDist(\n repeater,\n tail='left',\n measure=CrossValidation(\n clf,\n ChainNode([partitioner, permutator],\n space=partitioner.get_space()))\n )\n )"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n\n",
"If we run this on our dataset, we no longer get a single performance value,\nbut one per data-fold (chunk) instead:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"err = cvf(ds)\n",
"len(err) == len(np.unique(ds.sa.chunks))"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n\n",
"But here comes the interesting bit:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"len(cvf.ca.null_prob) == len(err)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n\n",
"So we get one p-value for each element in the datasets returned by the\ncross-validation run. More generally speaking, the distribution estimation\nhappens independently for each value returned by a measure -- may this be\nmultiple samples, or multiple features, or both. Consequently, it is possible\nto test a large variety of measure with this facility."
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Evaluating multi-class classifications"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n\n",
"So far we have mostly looked at the situation where a classifier is trying to\ndiscriminate data from two possible classes. In many cases we can assume that a\nclassifier that ",
"*cannot* discriminate these two classes would perform at a\nchance-level of 0.5 (ACC). If it does that we would conclude that there is no\nsignal of interest in the data, or our classifier of choice cannot pick it up.\nHowever, there is a whole universe of classification problems where it is not\nthat simple.\n\n",
"Let's revisit the classification problem from ",
"*the chapter on classifiers*."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from mvpa2.tutorial_suite import *\n",
"ds = get_haxby2001_data_alternative(roi='vt', grp_avg=False)\n",
"print ds.sa['targets'].unique"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"clf = kNN(k=1, dfx=one_minus_correlation, voting='majority')\n",
"cv = CrossValidation(clf, NFoldPartitioner(), errorfx=mean_mismatch_error,\n enable_ca=['stats'])\n",
"cv_results = cv(ds)\n",
"print '%.2f' % np.mean(cv_results)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n\n",
"So here we have an 8-way classification problem, and during the cross-validation\nprocedure the chosen classifier makes correct predictions for approximately\nhalf of the data points. The big question is now: ",
"**What does that tell us?**\n\n",
"There are many scenarios that could lead to this prediction performance. It\ncould be that the fitted classifier model is very good, but only captures the\ndata variance for half of the data categories/classes. It could also be that\nthe classifier model quality is relatively poor and makes an equal amount of\nerrors for all classes. In both cases the average accuracy will be around 50%,\nand most likely ",
"**highly significant**, given a chance performance of 1/8. We\ncould now spend some time testing this significance with expensive permutation\ntests, or making assumptions on the underlying distribution. However, that\nwould only give us a number telling us that the average accuracy is really\ndifferent from chance, but it doesn't help with the problem that the accuracy\nreally doesn't tell us much about what we are interested in.\n\n",
"Interesting hypotheses in the context of this dataset could be whether the data\ncarry a signal that can be used to distinguish brain response patterns from\nanimate vs. inanimate stimulus categories, or whether data from object-like\nstimuli are all alike and can only be distinguished from random noise, etc. One\ncan imagine running such an analysis on data from different parts of the brain\nand the results changing -- without necessarily having a big impact on the\noverall classification accuracy.\n\n",
"A lot more interesting information is available from the confusion matrix, a\ncontingency table showing prediction targets vs. actual predictions."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print cv.ca.stats.matrix"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n\n",
"We can see a strong diagonal, but also block-like structure, and have to\nrealize that simply staring at the matrix doesn't help us to easily assess the\nlikelihood of any of our hypotheses being true or false. It is trivial to do a\nChi-square test of the confusion table..."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print 'Chi^2: %.3f (p=%.3f)' % cv.ca.stats.stats[\"CHI^2\"]"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n\n",
"... but, again, it doesn't tell us anything other than that the classifier is\nnot just doing random guesses. It would be much more useful if we could\nestimate how likely it is, given the observed confusion matrix, that the\nemployed classifier is able to discriminate ",
"*all* stimulus classes from each\nother, and not just a subset. Even more useful would be if we could relate\nthis probability to specific alternative hypotheses, such as an\nanimate/inanimate-only distinction.\n\n",
"*Olivetti et al. (2012)* have devised a method that allows for\ndoing exactly that. The confusion matrix is analyzed in a Bayesian framework\nregarding the statistical dependency of observed and predicted class labels.\nConfusions within a set of classes that cannot be discriminated should be\nindependently distributed, while there should be a statistical dependency of\nconfusion patterns within any set of classes that can all be discriminated from\neach other.\n\n",
"This algorithm is available in the\n",
"[BayesConfusionHypothesis](http://pymvpa.org/generated/mvpa2.clfs.transerror.BayesConfusionHypothesis.html#mvpa2-clfs-transerror-bayesconfusionhypothesis) node."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"cv = CrossValidation(clf, NFoldPartitioner(),\n errorfx=None,\n postproc=ChainNode((Confusion(labels=ds.UT),\n BayesConfusionHypothesis())))\n",
"cv_results = cv(ds)\n",
"print cv_results.fa.stat"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n\n",
"Most likely hypothesis to explain this confusion matrix:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print cv_results.sa.hypothesis[np.argsort(cv_results.samples[:,1])[-1]]"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Previously in part 8"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n\n",
"Previously, ",
"*while looking at classification*\nwe have observed that classification error depends on the chosen\nclassification method, data preprocessing, and how the error was obtained --\ntraining error vs generalization estimates using different data splitting\nstrategies. Moreover in ",
"*attempts to localize activity using searchlight* we saw that generalization error can reach\nrelatively small values even when processing random data which (should) have\nno true signal. So, the value of the error alone does not provide\nsufficient evidence to state that our classifier or any other method actually\nlearnt the mapping from the data into variables of interest. So, how do we\ndecide what estimate of error can provide us sufficient evidence that\nconstructed mapping reflects the underlying phenomenon or that our data\ncarried the signal of interest?\n\n",
"Researchers interested in developing statistical learning methods usually aim\nat achieving as high generalization performance as possible. Newly published\nmethods often stipulate their advantage over existing ones by comparing their\ngeneralization performance on publicly available datasets with known\ncharacteristics (number of classes, independence of samples, actual presence\nof information of interest, etc.). Therefore, generalization performances\npresented in statistical learning publications are usually high enough to\nobliterate even a slight chance that they could have been obtained simply by\nchance. For example, those classifiers trained on ",
"[MNIST](http://yann.lecun.com/exdb/mnist) dataset of\nhandwritten digits were worth reporting whenever they demonstrated average\n",
"**errors of only 1-2%** while doing classification among samples of 10 different\ndigits (the largest error reported was 12% using the simplest classification\napproach).\n\n",
"The situation is substantially different in the domain of neural data\nanalysis. There classification is most often used not to construct a reliable\nmapping from data into behavioral variable(s) with as small error as possible,\nbut rather to show that learnt mapping is good enough to claim that such\nmapping exists and data carries the effects caused by the corresponding\nexperiment. Such an existence claim is conventionally verified with a\nclassical methodology of null-hypothesis (H0) significance testing (NHST),\nwhenever the achievement of generalization performance with ",
"*statistically\nsignificant* excursion away from the ",
"*chance-level* is taken as the proof that\ndata carries effects of interest.\n\n",
"The main conceptual problem with NHST is a widespread belief that having observed\nthe data, the level of significance at which H0 could be rejected is equivalent to the\nprobability of the H0 being true. I.e. if it is unlikely that data comes from\nH0, it is as unlikely for H0 being true. Such assumptions were shown to be\ngenerally wrong using ",
"*deductive and Bayesian reasoning* since\nP(D|H0) not equal P(H0|D) (unless P(D)==P(H0)). Moreover, ",
"*statistical\nsignificance* alone, taken without accompanying support on viability and\nreproducibility of a given finding, was argued ",
"*more likely to be*.\n\n",
"What differs multivariate analysis from univariate is that it",
"\n\n* ",
"avoids ",
"**multiple comparisons** problem in NHST",
"\n\n* ",
"has higher ",
"**flexibility**, thus lower ",
"**stability**",
"\n\n",
"Multivariate methods became very popular in the last decade of neuroimaging\nresearch partially due to their inherent ability to avoid multiple comparisons\nissue, which is a flagman of difficulties while going for a ",
"*fishing\nexpedition* with univariate methods. Performing cross-validation on entire\nROI or even full-brain allowed people to state presence of so desired effects\nwithout defending chosen critical value against multiple-comparisons.\nUnfortunately, as there is no such thing as ",
"*free lunch*, ability to work with\nall observable data at once came at a price for multivariate methods.\n\n",
"The second peculiarity of the application of statistical learning in\npsychological research is the actual neural data which researchers are doomed\nto analyze. As we have already seen from previous tutorial parts, typical\nfMRI data has",
"\n\n* ",
"relatively ",
"**low number of samples** (up to few thousands in total)",
"\n\n* ",
"relatively ",
"**large dimensionality** (tens of thousands)",
"\n\n* ",
"**small signal-to-noise ratio**",
"\n\n* ",
"**non-independent measurements**",
"\n\n* ",
"**unknown ground-truth** (either there is an effect at all, or if there is --\nwhat is inherent bias/error)",
"\n\n* ",
"**unknown nature of the signal**, since BOLD effect is not entirely\nunderstood.",
"\n\n",
"In the following part of the tutorial we will investigate the effects of some\nof those factors on classification performance with simple (or not so)\nexamples. But first lets overview the tools and methodologies for NHST\ncommonly employed."
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Statistical Tools in Python"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n\n",
"`scipy` Python module is an umbrella project to cover the majority of core\nfunctionality for scientific computing in Python. In turn, ",
"[stats](http://pymvpa.org/generated/scipy.stats.html#scipy-stats)\nsubmodule covers a wide range of continuous and discrete distributions and\nstatistical functions."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- - -\n**Exercise**",
"\n\n",
"Glance over the ",
"`scipy.stats` documentation for what statistical functions\nand distributions families it provides. If you feel challenged, try to\nfigure out what is the meaning/application of ",
"[rdist()](http://pymvpa.org/generated/scipy.stats.rdist.html#scipy-stats-rdist)."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# you can use this cell for this exercise"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- - -\n",
"\n\n",
"The most popular distribution employed for NHST in the context of statistical\nlearning, is ",
"[binom](http://pymvpa.org/generated/scipy.stats.binom.html#scipy-stats-binom) for testing either generalization\nperformance of the classifier on independent data could provide evidence that\nthe data contains the effects of interest."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- - -\n*Note*",
"\n\n",
"`scipy.stats` provides function ",
"[binom_test()](http://pymvpa.org/generated/scipy.stats.binom_test.html#scipy-stats-binom-test), but that\none was devised only for doing two-sides tests, thus is not directly\napplicable for testing generalization performance where we aim at the tail\nwith lower than chance performance values.",
"- - -\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- - -\n**Exercise**",
"\n\n",
"Think about scenarios when could you achieve strong and very significant\nmis-classification performance, i.e. when, for instance, binary classifier\ntends to generalize into the other category. What could it mean?"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# you can use this cell for this exercise"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- - -\n",
"\n\n",
"[binom](http://pymvpa.org/generated/scipy.stats.binom.html#scipy-stats-binom) whenever instantiated with the parameters of the\ndistribution (which are number of trials, probability of success on each\ntrial), it provides you ability to easily compute a variety of statistics of\nthat distribution. For instance, if we want to know, what would be the probability of having achieved\n57 of more correct responses out of 100 trials, we need to use a survival\nfunction (1-cdf) to obtain the ",
"*weight* of the right tail including 57\n(i.e. query for survival function of 56):"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from scipy.stats import binom\n",
"binom100 = binom(100, 1./2)\n",
"print '%.3g' % binom100.sf(56)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n\n",
"Apparently obtaining 57 correct out 100 cannot be considered significantly\ngood performance by anyone. Lets investigate how many correct responses we\nneed to reach the level of 'significance' and use ",
"*inverse survival function*:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"binom100.isf(0.05) + 1"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"binom100.isf(0.01) + 1"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n\n",
"So, depending on your believe and prior support for your hypothesis and data\nyou should get at least 59-63 correct responses from a 100 trials to claim\nthe existence of the effects. Someone could rephrase above observation that to\nachieve significant performance you needed an effect size of 9-13\ncorrespondingly for those two levels of significance."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- - -\n**Exercise**",
"\n\n",
"Plot a curve of ",
"*effect sizes* (number of correct predictions above\nchance-level) vs a number of trials at significance level of 0.05 for a range\nof trial numbers from 4 to 1000. Plot %-accuracy vs number of trials for\nthe same range in a separate plot. TODO"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# you can use this cell for this exercise"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- - -\n"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Dataset Exploration for Confounds"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n\n",
"*\"Randomization is a crucial aspect of experimental design... In the*.\n\n",
"\n\n> Unfortunately it is impossible to detect and warn about all possible sources\nof confounds which would invalidate NHST based on a simple parametric binomial\ntest. As a first step, it is always useful to inspect your data for possible\nsources of samples non-independence, especially if your results are not\nstrikingly convincing or too provocative. Possible obvious problems could be:",
"dis-balanced testing sets (usually non-equal number of samples for each\nlabel in any given chunk of data)order effects: either preference of having samples of particular target\nin a specific location or the actual order of targets",
"\n\n",
"To allow for easy inspection of dataset to prevent such obvious confounds,\n",
"[summary()](http://pymvpa.org/generated/mvpa2.datasets.miscfx.summary.html#mvpa2-datasets-miscfx-summary) function (also a method of any\n",
"`Dataset`) was constructed. Lets have yet another look at our 8-categories\ndataset:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from mvpa2.tutorial_suite import *\n",
"ds = get_haxby2001_data(roi='vt')\n",
"print ds.summary()"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n\n",
"You can see that labels were balanced across chunks -- i.e. that each chunk\nhas an equal number of samples of each target label, and that samples of\ndifferent labels are evenly distributed across chunks. TODO...\n\n",
"Counter-balance table shows either there were any order effects among\nconditions. In this case we had only two instances of each label in the\ndataset due to the averaging of samples across blocks, so it would be more\ninformative to look at the original sequence. To do so avoiding loading a\ncomplete dataset we would simply provide the stimuli sequence to\n",
"[SequenceStats](http://pymvpa.org/generated/mvpa2.clfs.miscfx.SequenceStats.html#mvpa2-clfs-miscfx-sequencestats) for the analysis:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"attributes_filename = os.path.join(pymvpa_dataroot, 'attributes_literal.txt')\n",
"attr = SampleAttributes(attributes_filename)\n",
"targets = np.array(attr.targets)\n",
"ss = SequenceStats(attr.targets)\n",
"print ss"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n\n",
"Order statistics look funky at first, but they would not surprise you if you\nrecall the original design of the experiment -- blocks of 8 TRs per each\ncategory, interleaved with 6 TRs of rest condition. Since samples from two\nadjacent blocks are far apart enough not to contribute to 2-back table (O2\ntable on the right), it is worth inspecting if there was any dis-balance in\nthe order of the picture conditions blocks. It would be easy to check if we\nsimply drop the 'rest' condition from consideration:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print SequenceStats(targets[targets != 'rest'])"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n\n",
"TODO"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- - -\n**Exercise**",
"\n\n",
"Generate few 'designs' consisting of varying condition sequences and assess\ntheir counter-balance. Generate some random designs using random number\ngenerators or permutation functions provided in ",
"[None](http://pymvpa.org/generated/numpy.random.html#numpy-random) and\nassess their counter-balance."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# you can use this cell for this exercise"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- - -\n",
"\n\n",
"\n\n> Some sources of confounds might be hard to detect or to eliminate:",
"dependent variable is assessed after data has been collected (RT, ACC,\netc) so it might be hard to guarantee equal sampling across different\nsplits of the data.motion effects, if motion is correlated with the design, might introduce\nmajor confounds into the signal. With multivariate analysis the problem\nbecomes even more sever due to the high sensitivity of multivariate methods\nand the fact that motion effects might be impossible to eliminate entirely\nsince they are strongly non-linear. So, even if you regress out whatever\nnumber of descriptors describing motion (mean displacement, angles, shifts,\netc.) you would not be able to eliminate motion effects entirely. And that\nresidual variance from motion spread through the entire volume might\ncontribute to your ",
"*generalization performance*."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- - -\n**Exercise**",
"\n\n",
"Inspect the arguments of generic interface of all splitters\n",
"[Splitter](http://pymvpa.org/generated/mvpa2.datasets.splitters.Splitter.html#mvpa2-datasets-splitters-splitter) for a possible workaround in the\ncase of dis-balanced targets."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# you can use this cell for this exercise"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- - -\n",
"\n\n",
"Therefore, before the analysis on the actual fMRI data, it might be worth\ninspecting what kind of ",
"[generalization](http://pymvpa.org/glossary.html#term-generalization) performance you might obtain if\nyou operate simply on the confounds (e.g. motion parameters and effects)."
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Hypothesis Testing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- - -\n*Note*",
"\n\n",
"When thinking about what critical value to choose for NHST keep such\n",
"*guidelines from NHST inventor, Dr.Fisher* in mind. For\nsignificance range '0.2 - 0.5' he says: \"judged significant, though barely\nso; ... these data do not, however, demonstrate the point beyond possibility\nof doubt\".",
"- - -\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n\n",
"Ways to assess ",
"*by-chance* null-hypothesis distribution of measures range from\nfixed, to estimated parametric, to non-parametric permutation testing.\nUnfortunately not a single way provides an ultimate testing facility to be\napplied blindly to any chosen problem without investigating the\nappropriateness of the data at hand (see previous section). Every kind of\n",
"[Measure](http://pymvpa.org/generated/mvpa2.measures.base.Measure.html#mvpa2-measures-base-measure) provides an easy way to trigger\nassessment of ",
"*statistical significance* by specifying `null_dist` parameter\nwith a distribution estimator. After a given measure is computed, the\ncorresponding p-value(s) for the returned value(s) could be accessed at\n`ca.null_prob`.\n\n",
"*\"Applications of permutation testing methods to single subject fMRI*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- - -\n**Exercise**",
"\n\n",
"Try to assess significance of the finding on two problematic categories\nfrom 8-categories dataset without averaging the samples within the blocks\nof the same target. Even non-parametric test should be overly optimistic\n(forgotten ",
"**exchangeability** requirement for parametric testing, such as\nmultiple samples within a block for a block design)... TODO"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# you can use this cell for this exercise"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- - -\n"
]
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Independent Samples"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n\n",
"Since \"voodoo correlations\" paper, most of the literature in brain imaging is\nseems to became more careful in avoiding \"double-dipping\" and keeping their\ntesting data independent from training data, which is one of the major\nconcerns for doing valid hypothesis testing later on. Not much attention is\ngiven though to independence of samples aspect -- i.e. not only samples in\ntesting set should be independent from training ones, but, to make binomial\ndistribution testing valid, testing samples should be independent from each\nother as well. The reason is simple -- number of the testing samples defines\nthe width of the null-chance distribution, but consider the limiting case\nwhere all testing samples are heavily non-independent, consider them to be a\n1000 instances of the same sample. Canonical binomial distribution would be\nvery narrow, although effectively it is just 1 independent sample being\ntested, thus ... TODO"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Statistical Treatment of Sensitivities"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- - -\n*Note*",
"\n\n",
"Statistical learning is about constructing reliable models to\ndescribe the data, and not really to reason either data is noise.",
"- - -\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- - -\n*Note*",
"\n\n",
"How do we decide to threshold sensitivities, remind them searchlight\nresults with strong bimodal distributions, distribution outside of\nthe brain as a true by-chance. May be reiterate that sensitivities\nof bogus model are bogus",
"- - -\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n\n",
"Moreover, constructed mapping with barely ",
"*above-chance* performance is often\nfurther analyzed for its ",
"*sensitivity to the input variables*."
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"References"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n\n",
"\n*Cohen, J. (1994)*: *Classical critic of null hypothesis significance testing*",
"\n\n",
"\n*Fisher, R. A. (1925)*: *One of the 20th century's most influential books on statistical methods, which\ncoined the term 'Test of significance'.*",
"\n\n",
"\n*Ioannidis, J. (2005)*: *Simulation study speculating that it is more likely for a research claim to\nbe false than true. Along the way the paper highlights aspects to keep in\nmind while assessing the 'scientific significance' of any given study, such\nas, viability, reproducibility, and results.*",
"\n\n",
"\n*Nichols et al. (2002)*: *Overview of standard nonparametric randomization and permutation testing\napplied to neuroimaging data (e.g. fMRI)*",
"\n\n",
"\n*Wright, D. (2009)*: *Historical excurse into the life of 10 prominent statisticians of XXth century\nand their scientific contributions.*"
]
}
],
"metadata": {}
}
]
}