Classifiers – All Alike, Yet Different


This tutorial part is also available for download as an IPython notebook: [ipynb]

In this chapter we will continue our work from Getting data in shape in order to replicate the study of Haxby et al. (2001). For this tutorial there is a little helper function to yield the dataset we generated manually before:

>>> from mvpa2.tutorial_suite import *
>>> ds = get_haxby2001_data()

The original study employed a so-called 1-nearest-neighbor classifier, using correlation as a distance measure. In PyMVPA this type of classifier is provided by the kNN class, that makes it possible to specify the desired parameters.

>>> clf = kNN(k=1, dfx=one_minus_correlation, voting='majority')

A k-Nearest-Neighbor classifier performs classification based on the similarity of a sample with respect to each sample in a training dataset. The value of k specifies the number of neighbors to derive a prediction, dfx sets the distance measure that determines the neighbors, and voting selects a strategy to choose a single label from the set of targets assigned to these neighbors.


Access the built-in help to inspect the kNN class regarding additional configuration options.

Now that we have a classifier instance, it can be easily trained by passing the dataset to its train() method.

>>> clf.train(ds)

A trained classifier can subsequently be used to perform classification of unlabeled samples. The classification performance can be assessed by comparing these predictions to the target labels.

>>> predictions = clf.predict(ds.samples)
>>> np.mean(predictions ==

We see that the classifier performs remarkably well on our dataset – it doesn’t make even a single prediction error. However, most of the time we would not be particularly interested in the prediction accuracy of a classifier on the same dataset that it got trained with.


Think about why this particular classifier will always perform error-free classification of the training data – regardless of the actual dataset content. If the reason is not immediately obvious, take a look at chapter 13.3 in The Elements of Statistical Learning. Investigate how the accuracy varies with different values of k. Why is that?

Instead, we are interested in the generalizability of the classifier on new, unseen data. This would allow us, in principle, to use it to assign labels to unlabeled data. Because we only have a single dataset, it needs to be split into (at least) two parts to achieve this. In the original study, Haxby and colleagues split the dataset into patterns of activations from odd versus even-numbered runs. Our dataset has this information in the runtype sample attribute:

>>> print
['even' 'even' 'even' 'even' 'even' 'even' 'even' 'even' 'odd' 'odd' 'odd'
 'odd' 'odd' 'odd' 'odd' 'odd']

Using this attribute we can now easily split the dataset in half. PyMVPA datasets can be sliced in similar ways as NumPy‘s ndarray. The following calls select the subset of samples (i.e. rows in the datasets) where the value of the runtype attribute is either the string ‘even’ or ‘odd’.

>>> ds_split1 = ds[ == 'odd']
>>> len(ds_split1)
>>> ds_split2 = ds[ == 'even']
>>> len(ds_split2)

Now we could repeat the steps above: call train() with one dataset half and predict() with the other, and compute the prediction accuracy manually. However, a more convenient way is to let the classifier do this for us. Many objects in PyMVPA support a post-processing step that we can use to compute something from the actual results. The example below computes the mismatch error between the classifier predictions and the target values stored in our dataset. To make this work, we do not call the classifier’s predict() method anymore, but “call” the classifier directly with the test dataset. This is a very common usage pattern in PyMVPA that we shall see a lot over the course of this tutorial. Again, please note that we compute an error now, hence lower values represent more accurate classification.

>>> clf.set_postproc(BinaryFxNode(mean_mismatch_error, 'targets'))
>>> clf.train(ds_split2)
>>> err = clf(ds_split1)
>>> print np.asscalar(err)

In this case, our choice of which half of the dataset is used for training and which half for testing was completely arbitrary, hence we could also estimate the transfer error after swapping the roles:

>>> clf.train(ds_split1)
>>> err = clf(ds_split2)
>>> print np.asscalar(err)

We see that on average the classifier error is really low, and we achieve an accuracy level comparable to the results reported in the original study.


What we have just done was to manually split the dataset into combinations of training and testing datasets, given a specific sample attribute – in this case whether a pattern of activation or sample came from even or odd runs. We ran the classification analysis on each split to estimate the performance of the classifier model. In general, this approach is called cross-validation, and involves splitting the dataset into multiple pairs of subsets, choosing sample groups by some criterion, and estimating the classifier performance by training it on the first dataset in a split and testing against the second dataset from the same split.

PyMVPA provides a way to allow complete cross-validation procedures to run fully automatic, without the need for manual splitting of a dataset. Using the CrossValidation class, a cross-validation is set up by specifying what measure should be computed on each dataset split and how dataset splits should be generated. The measure that is usually computed is the transfer error that we already looked at in the previous section. The second element, a generator for datasets, is another very common tool in PyMVPA. The following example uses HalfPartitioner, a generator that, when called with a dataset, marks all samples regarding their association with the first or second half of the dataset. This happens based on the values of a specified sample attribute – in this case runtype – much like the manual dataset splitting that we have performed earlier. HalfPartitioner will make sure to subsequently assign samples to both halves, i.e. samples from the first half in the first generated dataset will be in the second half of the second generated dataset. With these two techniques we can replicate our manual cross-validation easily – reusing our existing classifier, but without the custom post-processing step.

>>> # disable post-processing again
>>> clf.set_postproc(None)
>>> # dataset generator
>>> hpart = HalfPartitioner(attr='runtype')
>>> # complete cross-validation facility
>>> cv = CrossValidation(clf, hpart)


Try calling the hpart object with our dataset. What happens? Now try passing the dataset to its generate() method. What happens now? Make yourself familiar with the concept of a Python generator. Investigate what the code snippet list(xrange(5)) does, and try to adapt it to the HalfPartitioner.

Once the cv object is created, it can be called with a dataset, just like we did with the classifier before. It will internally perform all the dataset partitioning, split each generated dataset into training and testing sets (based on the partitions), and train and test the classifier repeatedly. Finally, it will return the results of all cross-validation folds.

>>> cv_results = cv(ds)
>>> np.mean(cv_results)

Actually, the cross-validation results are returned as another dataset that has one sample per fold and a single feature with the computed transfer-error per fold.

>>> len(cv_results)
>>> cv_results.samples
array([[ 0.   ],
       [ 0.125]])

Any classifier, really

A short summary of all code for the analysis we developed so far is this:

>>> clf = kNN(k=1, dfx=one_minus_correlation, voting='majority')
>>> cvte = CrossValidation(clf, HalfPartitioner(attr='runtype'))
>>> cv_results = cvte(ds)
>>> np.mean(cv_results)

Looking at this little code snippet we can nicely see the logical parts of a cross-validated classification analysis.

  1. Load the data
  2. Choose a classifier
  3. Set up an error function
  4. Evaluate the error in a cross-validation procedure
  5. Inspect results

Our previous choice of the classifier was guided by the intention to replicate Haxby et al. (2001), but what if we want to try a different algorithm? In this case another nice feature of PyMVPA comes into play. All classifiers implement a common interface that makes them easily interchangeable without the need to adapt any other part of the analysis code. If, for example, we want to try the popular svm (Support Vector Machines) on our example dataset it looks like this:

>>> clf = LinearCSVMC()
>>> cvte = CrossValidation(clf, HalfPartitioner(attr='runtype'))
>>> cv_results = cvte(ds)
>>> np.mean(cv_results)

Instead of k-nearest-neighbor, we create a linear SVM classifier, internally using the popular LIBSVM library (note that PyMVPA provides additional SVM implementations). The rest of the code remains identical. SVM with its default settings seems to perform slightly worse than the simple kNN-classifier. We’ll get back to the classifiers shortly. Let’s first look at the remaining part of this analysis.

We already know that CrossValidation can be used to compute errors. So far we have only used the mean number of mismatches between actual targets and classifier predictions as the error function (which is the default). However, PyMVPA offers a number of alternative functions in the errorfx module, but it is also trivial to specify custom ones. For example, if we do not want to have error reported, but instead accuracy, we can do that:

>>> cvte = CrossValidation(clf, HalfPartitioner(attr='runtype'),
...                        errorfx=lambda p, t: np.mean(p == t))
>>> cv_results = cvte(ds)
>>> np.mean(cv_results)

This example reuses the SVM classifier we have create before, and yields exactly what we expect from the previous result.

The details of the cross-validation procedure are also heavily customizable. We have seen that a Partitioner is used to generate training and testing dataset for each cross-validation fold. So far we have only used HalfPartitioner to divide the dataset into odd and even runs (based on our custom sample attribute runtype). However, in general it is more common to perform so called leave-one-out cross-validation, where one independent part of a dataset is selected as testing dataset, while the other parts constitute the training dataset. This procedure is repeated till all parts have served as the testing dataset once. In case of our dataset we could consider each of the 12 runs as independent measurements (fMRI data doesn’t allow us to consider temporally adjacent data to be considered independent).

To run such an analysis, we first need to redo our dataset preprocessing, as, in the current one, we only have one sample per stimulus category for both odd and even runs. To get a dataset with one sample per stimulus category for each run, we need to modify the averaging step. Using what we have learned from the last tutorial part the following code snippet should be plausible:

>>> # directory that contains the data files
>>> datapath = os.path.join(tutorial_data_path, 'haxby2001')
>>> # load the raw data
>>> ds = load_tutorial_data(roi='vt')
>>> # pre-process
>>> poly_detrend(ds, polyord=1, chunks_attr='chunks')
>>> zscore(ds, param_est=('targets', ['rest']))
>>> ds = ds[ != 'rest']
>>> # average
>>> run_averager = mean_group_sample(['targets', 'chunks'])
>>> ds = ds.get_mapped(run_averager)
>>> ds.shape
(96, 577)

Instead of two samples per category in the whole dataset, now we have one sample per category, per experiment run, hence 96 samples in the whole dataset. To set up a 12-fold leave-one-run-out cross-validation, we can make use of NFoldPartitioner. By default it is going to select samples from one chunk at a time:

>>> cvte = CrossValidation(clf, NFoldPartitioner(),
...                        errorfx=lambda p, t: np.mean(p == t))
>>> cv_results = cvte(ds)
>>> np.mean(cv_results)

We get almost the same prediction accuracy (reusing the SVM classifier and our custom error function). Note that this time we performed the analysis on a lot more samples that were each was computed from just a few fMRI volumes (about nine each).

So far we have just looked at the mean accuracy or error. Let’s investigate the results of the cross-validation analysis a bit further.

>>> type(cv_results)
<class 'mvpa2.datasets.base.Dataset'>
>>> print cv_results.samples
[[ 0.75 ]
 [ 0.875]
 [ 1.   ]
 [ 0.75 ]
 [ 0.75 ]
 [ 0.875]
 [ 0.75 ]
 [ 0.875]
 [ 0.75 ]
 [ 0.375]
 [ 1.   ]
 [ 0.625]]

The returned value is actually a Dataset with the results for all cross-validation folds. Since our error function computes only a single scalar value for each fold the dataset only contains a single feature (in this case the accuracy), and a sample per each fold.

We Need To Take A Closer Look

By now we have already done a few cross-validation analyses using two different classifiers and different pre-processing strategies. In all these cases we have just looked at the generalization performance or error. However, error rates hide a lot of interesting information that is very important for an interpretation of results. In our case we analyzed a dataset with eight different categories. An average misclassification rate doesn’t tell us much about the contribution of each category to the prediction error. It could be that half of the samples of each category get misclassified, but the same average error might be due to all samples from half of the categories being completely misclassified, while prediction accuracy for samples from the remaining categories is perfect. These two results would have to be interpreted in totally different ways, despite the same average error rate.

In psychological research this type of results is usually presented as a contingency table or cross tabulation of expected vs. empirical results. Signal detection theory offers a whole range of techniques to characterize such results. From this angle a classification analysis is hardly any different from a psychological experiment where a human observer performs a detection task, hence the same analysis procedures can be applied here as well.

PyMVPA provides convenient access to confusion matrices, i.e. contingency tables of targets vs. actual predictions. However, to prevent wasting CPU-time and memory they are not computed by default, but instead have to be enabled explicitly. Optional analysis results like this are available in a dedicated collection of conditional attributes – analogous to sa and fa in datasets, it is named ca. Let’s see how it works:

>>> cvte = CrossValidation(clf, NFoldPartitioner(),
...                        errorfx=lambda p, t: np.mean(p == t),
...                        enable_ca=['stats'])
>>> cv_results = cvte(ds)

Via the enable_ca argument we triggered computing confusion tables for all cross-validation folds, but otherwise there is no change in the code. Afterwards the aggregated confusion for the whole cross-validation procedure is available in the ca collection. Let’s take a look (note that in the printed manual the output is truncated due to page-width constraints – please refer to the HTML-based version full the full matrix).

>>> print
predictions\targets     bottle         cat          chair          face         house        scissors    scrambledpix      shoe
            `------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------ P'   N'   FP   FN   PPV  NPV  TPR  SPC  FDR  MCC  F1
       bottle             6             0             3             0             0             5             0             1       15   75    9    6   0.4 0.92  0.5 0.88  0.6 0.34 0.44
        cat               0             10            0             0             0             0             0             0       10   67    0    2    1  0.97 0.83   1    0  0.79 0.91
       chair              0             0             7             0             0             0             0             0        7   73    0    5    1  0.93 0.58   1    0  0.66 0.74
        face              0             2             0             12            0             0             0             0       14   63    2    0  0.86   1    1  0.97 0.14  0.8 0.92
       house              0             0             0             0             12            0             0             0       12   63    0    0    1    1    1    1    0  0.87   1
      scissors            2             0             1             0             0             6             0             0        9   75    3    6  0.67 0.92  0.5 0.96 0.33 0.48 0.57
    scrambledpix          2             0             1             0             0             0             12            1       16   63    4    0  0.75   1    1  0.94 0.25 0.75 0.86
        shoe              2             0             0             0             0             1             0             10      13   67    3    2  0.77 0.97 0.83 0.96 0.23 0.69  0.8
Per target:          ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
         P                12            12            12            12            12            12            12            12
         N                84            84            84            84            84            84            84            84
         TP               6             10            7             12            12            6             12            10
         TN               69            65            68            63            63            69            63            65
Summary \ Means:     ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------ 12 68.25 2.62 2.62 0.81 0.96 0.78 0.96 0.19 0.67 0.78
       CHI^2            442.67       p=2e-58
        ACC              0.78
        ACC%            78.12
     # of sets            12       ACC(i) = 0.87-0.015*i p=0.3 r=-0.33 r^2=0.11

Statistics computed in 1-vs-rest fashion per each target.
Abbreviations (for details see
 TP : true positive (AKA hit)
 TN : true negative (AKA correct rejection)
 FP : false positive (AKA false alarm, Type I error)
 FN : false negative (AKA miss, Type II error)
 TPR: true positive rate (AKA hit rate, recall, sensitivity)
      TPR = TP / P = TP / (TP + FN)
 FPR: false positive rate (AKA false alarm rate, fall-out)
      FPR = FP / N = FP / (FP + TN)
 ACC: accuracy
      ACC = (TP + TN) / (P + N)
 SPC: specificity
      SPC = TN / (FP + TN) = 1 - FPR
 PPV: positive predictive value (AKA precision)
      PPV = TP / (TP + FP)
 NPV: negative predictive value
      NPV = TN / (TN + FN)
 FDR: false discovery rate
      FDR = FP / (FP + TP)
 MCC: Matthews Correlation Coefficient
      MCC = (TP*TN - FP*FN)/sqrt(P N P' N')
 F1 : F1 score
      F1 = 2TP / (P + P') = 2TP / (2TP + FP + FN)
 AUC: Area under (AUC) curve
 CHI^2: Chi-square of confusion matrix
 LOE(ACC): Linear Order Effect in ACC across sets
 # of sets: number of target/prediction sets which were provided

This output is a comprehensive summary of the performed analysis. We can see that the confusion matrix has a strong diagonal, and confusion happens mostly among small objects. In addition to the plain contingency table there are also a number of useful summary statistics readily available – including average accuracy.

Especially for multi-class datasets the matrix quickly becomes incomprehensible. For these cases the confusion matrix can also be plotted via its plot() method. If the confusions shall be used as input for further processing they can also be accessed in pure matrix format:

>>> print
[[ 6  0  3  0  0  5  0  1]
 [ 0 10  0  0  0  0  0  0]
 [ 0  0  7  0  0  0  0  0]
 [ 0  2  0 12  0  0  0  0]
 [ 0  0  0  0 12  0  0  0]
 [ 2  0  1  0  0  6  0  0]
 [ 2  0  1  0  0  0 12  1]
 [ 2  0  0  0  0  1  0 10]]

The classifier confusions are just an example of the general mechanism of conditional attribute that is supported by many objects in PyMVPA.