Frequently Asked Questions

General

I’m a Matlab user. How hard is learning Python and PyMVPA for me?

If you are coming from Matlab, you will soon notice a lot of similarities between Matlab and Python (besides the huge advantages of Python over Matlab). For an easy transition you might want to have a look at a basic comparison of Matlab and NumPy.

It would be nice to have some guidelines on how to use PyMVPA for users who are already familiar with the Matlab MVPA toolbox. If you are using both packages and could compile a few tips, your contribution would be most welcome.

A recent paper by Jurica and van Leeuwen (2009) describes an open-source MATLAB®-to-Python compiler which might be a very useful tool to migrate a substantial amount of Matlab-based source code to Python and therefore also aids the migration of developers from Matlab to the new “general open-source lingua franca for scientific computation”.

It is sloooooow. What can I do?

Have you tried running the Python interpreter with -O? PyMVPA provides lots of debug messages with information that is computed in addition to the work that really has to be done. However, if Python is running in optimized mode, PyMVPA will not waste time on this and really tries to be fast.

If you are already running it optimized, then maybe you are doing something really demanding...

I am tired of writing these endless import blocks. Any alternative?

Sure. Instead of individually importing all pieces that are required by a script, you can import them all at once. A simple:

>>> import mvpa2.suite as mvpa2

makes everything directly accessible through the mvpa namespace, e.g. mvpa2.datasets.base.Dataset becomes mvpa2.Dataset. Really lazy people can even do:

>>> from mvpa2.suite import *

However, as always there is a price to pay for this convenience. In contrast to the individual imports there is some initial performance and memory cost. In the worst case you’ll get all external dependencies loaded (e.g. a full R session), just because you have them installed. Therefore, it might be better to limit this use to case where individual key presses matter and use individual imports for production scripts.

I feel like I want to contribute something, do you mind?

Not at all! If you think there is something that is not well explained in the documentation, send us an improvement. If you implemented a new algorithm using PyMVPA that you want to share, please share. If you have an idea for some other improvement (e.g. speed, functionality), but you have no time/cannot/do not want to implement it yourself, please post your idea to the PyMVPA mailing list.

I want to develop a new feature for PyMVPA. How can I do it efficiently?

The best way is to use Git for both, getting the latest code from the repository and preparing the patch. Here is a quick sketch of the workflow.

First get the latest code:

git clone git://github.com/PyMVPA/PyMVPA.git

This will create a new PyMVPA subdirectory, that contains the complete repository. Enter this directory and run gitk --all to browse the full history and all branches that have ever been published.

You can run:

git fetch origin

in this directory at any time to get the latest changes from the main repository.

Next, you have to decide what you want to base your new feature on. In the simplest case this is the master branch (the one that contains the code that will become the next release). Creating a local branch based on the (remote) master branch is:

git checkout -b my_hack origin/master

Now you are ready to start hacking. You are free to use all powers of Git (and yours, of course). You can do multiple commits, fetch new stuff from the repository, and merge it into your local branch, ... To get a feeling what can be done, take a look very short description of Git or a more comprehensive Git tutorial.

When you are done with the new feature, you can prepare the patch for inclusion into PyMVPA. If you have done multiple commits you might want to squash them into a single patch containing the new feature. You can do this with git rebase. Any recent version of git rebase has an option --interactive, which allows you to easily pick, squash or even further edit any of the previous commits you have made. Rebase your local branch against the remote branch you started hacking on (origin/master in this example):

git rebase --interactive origin/master

When you are done, you can generate the final patch file:

git format-patch origin/master

Above command will generate a file for each commit in you local branch that is not yet part of origin/master. The patch files can then be easily emailed.

The manual is quite insufficient. When will you improve it?

Writing a manual can be a tricky task if you already know the details and have to imagine what might be the most interesting information for someone who is just starting. If you feel that something is missing which has cost you some time to figure out, please drop us a note and we will add it as soon as possible. If you have developed some code snippets to demonstrate some feature or non-trivial behavior (maybe even trivial ones, which are not as obvious as they should be), please consider sharing this snippet with us and we will put it into the example collection or the manual. Thanks!

Data import, export and storage

What file formats are understood by PyMVPA?

Please see the data_formats section.

What if there is no special file format for some particular datatype?

With the h5save() function, PyMVPA supports storing any kind of serializable data into a (compressed) HDF5 file. The facility is particularly useful for storing any number of intermediate analysis results, e.g. for post-processing.

Data preprocessing

Is there an easy way to remove invariant features from a dataset?

You might have to deal with invariant features in case like an fMRI dataset, where the brain mask is slightly larger than the thresholded fMRI timeseries image. Such invariant features (i.e. features with zero variance) are sometime a problem, e.g. they will lead to numerical difficulties when z-scoring the features of a dataset (i.e. division by zero).

The mvpa2.datasets.miscfx module provides a convenience function remove_invariant_features() that strips such features from a dataset.

How can I do block-averaging of my block-design fMRI dataset?

The easiest way is to use a mapper to transform/average the respective samples. Suppose you have a dataset:

>>> dataset = normal_feature_dataset()
>>> print dataset
<Dataset: 100x4@float64, <sa: chunks,targets>>

Averaging all samples with the same label in each chunk individually is done by applying a mapper to the dataset.

>>> from mvpa2.mappers.fx import mean_group_sample
>>>
>>> m = mean_group_sample(['targets', 'chunks'])
>>> mapped_dataset = dataset.get_mapped(m)
>>> print mapped_dataset
<Dataset: 10x4@float64, <sa: chunks,targets>, <a: mapper>>

mean_group_sample creates an FxMapper that applies a function to every group of samples in each chunk individually and therefore yields one sample of each label per chunk.

Data analysis

How do I know which features were finally selected by a classifier doing feature selection?

All feature selection classifier use a built-in mapper to slice datasets. This mapper can be queried for selected features, or simply used to apply the same feature selection to other datasets.

>>> clf = FeatureSelectionClassifier(
...           kNN(k=5),
...           SensitivityBasedFeatureSelection(
...               SMLRWeights(SMLR(lm=1.0), postproc=maxofabs_sample()),
...               FixedNElementTailSelector(1, tail='upper', mode='select')))
>>> clf.train(dataset)
>>> len(clf.mapper.slicearg)
1
>>> final_dataset = clf.mapper.forward(dataset)
>>> print final_dataset
<Dataset: 100x1@float64, <sa: chunks,targets>>

In the above code snippet a kNN classifier is defined, that performs a feature selection step prior training. Features are selected according to the maximum absolute magnitude of the weights of a SMLR classifier trained on the data (same training data that will also go into kNN). Absolute SMLR weights are used for feature selection as large negative values also indicate important information. Finally, the classifier is configured to select the single most important feature (given the SMLR weights). After enabling the feature_ids state, the classifier provides the desired information, that can e.g. be applied to generate a stripped dataset for an analysis of the similarity structure.

How do I extract sensitivities from a classifier used within a cross-validation?

In various parts of PyMVPA it is possible to extract information from inside loops via callbacks. To extract sensitivities from inside a cross-validation analysis, without unnecessary retraining of the classifier one only needs to write a corresponding callback function. here is a sketch:

>>> sensitivities = []
>>> def store_me(data, node, result):
...     sens = node.measure.get_sensitivity_analyzer(force_train=False)(data)
...     sensitivities.append(sens)
>>>
>>> cv = CrossValidation(SMLR(), OddEvenPartitioner(), callback=store_me)
>>> merror = cv(dataset)
>>> len(sensitivities)
2
>>> sensitivities[0].shape == (len(dataset.uniquetargets), dataset.nfeatures)
True

First we set up a container (a list) to store the sensitivies for a cross-validation folds. next is the callback: It takes three arguments, as described in the documentation of RepeatedMeasure. The second argument is the node that is evaluated inside the loop. For a cross-validation this is a TransferMeasure that exposes its internal classifier via the measure property. The rest is straightforward. We contruct a sensitivity analyzer and pass the input dataset. Finally, we store the returned sensitivities.

Can PyMVPA deal with literal class labels?

Yes. For all external machine learning libraries that do not support literal labels, PyMVPA will transparently convert them to numerical ones, and also revert this transformation for all output values.