DisTools examples: Classifiers in dissimilarity space

This page show some introductory examples of DisTools in dissimilarity space, based on (2D-)feature spaces as well as on given dissimilarity matrices. It is assumed that readers are familiar with PRTools and will consult the following pages where needed:

The dissimilarity matrix

The dissimilarity matrix is build by the dissimilarities between a set of objects (training set, test set) and a representation set. This can be the available training set, the test set, a selected set of prototypes or even any other set of objects, as long as dissimilarities can be or have been computed. The dissimilarity matrix is considered as a representation of the $$m$$ objects by the $$r$$ objects of representation set. It has thereby a size of $$m times r$$. We will assume that it is available as a PRTools dataset of size [m,r]. Where needed (for training and evaluation) it should be labeled. Some procedures assume that the representation set is labeled as well.

From a given dissimilarity matrix of doubles X between a set of objects $$A$$ and a representation set $$R$$, the labels labels_A of $$A$$, a PRTools dataset D is constructed by:

D = prdataset(X,labels_A);

If the matrix is square, between the objects and themselves, the labels of the representation set can be added by:

D = setfeatlab(D,getlabels(D));

Alternatively, and in case the representation set is different from the represented objects, its labels labels_R can be added by:

D = setfeatlab(D,labels_R);

Sometimes not a dissimilarity matrix is given, but is similarity matrix: the higher the value of a similarity, the more similar the related objects. A dataset S based on similarities may be converted into a dissimilarity matrix by any monotonic decreasing function, e.g.

D = -S;
D = 1-S;
D = 1/S:

It depends on the scaling of the given similarities and the desired scaling of the similarities (see below) what is most appropriate.

The designer of a dissimilarity measure often focuses on just the ranking of the obtained dissimilarities for the objects under study. He aims to have, for as many object pairs as possible that the lowest dissimilarity of an object is with what he considers as the most similar object. This implies that for his target his measure is invariant for monotonic transformations. Many of the procedures we will consider (except for the nearest neighbor rules) this is not true. Thereby procedures may be optimized for monotonic transformations on the given dissimilarities, e.g.:

D = log(D);
D = sigm(D)

D = exp(D);
D = D.^2;

Given dissimilarities

The base procedure for any study on the dissimilarity representation is the nearest neighbor rule on the given dissimilarities. This assumes that the labels of the representation set are defined as shown above. The performance of the NN classifier on the given dissimilarities can be obtained by testkd:

D = protein   % make sure prdisdata is available and in the path
testkd(D,20)  % 20--NN classification
testkd(D)     % 1-NN classification

The error of the 1-NN classification is 0 because in the given dissimilarity dataset the representation set is equal to the total set of objects, which is common for most publicly available dissimilarity matrices. Moreover the dissimilarity measure used is such that dissimilarities between identical objects is 0 and is > 0 between non-identical objects. This can be verified by inspecting the matrix:

+D(1:8,1:8)

The leave-one-out (LOO) approach should be used to find a good estimate for the 1-NN performance. As it is very often desired to find the LOO 1-NN error for a square dissimilarity matrix a special routine is made to obtain it, nne.

nne(D) % LOO 1-NN classification
D*nne   % which is the same

If the rows of a dissimilarity matrix are interpreted as vectors representing the corresponding objects, the vector space constituted by the dissimilarities to the representation set is called the dissimilarity space. Its dimension equals the size of the representation set. Any classifier designed for a feature space can be handled here. The handling of training and test sets, especially cross-validation, needs some care. E.g. if it was decided that the representation set equals the training set, then the representation set has to change if training set changes. The standard procedure for cross-validation in PRTools, e.g.

crossval(D,svc,5,2)

splits for each fold the total dataset D in a trainset and a testset in the same, constant feature space determined by the total dataset. This is usually not what the user wants, as it does not give an error prediction for an arbitrary new incoming object. A special DisTools version of cross-validation, crossvald, can handle this:

crossvald(D,svc,5,R,2)

constructs for R = [] repsets that are identical to the present trainset. It has additional options for R. In case this parameter is set a random representation set of the desired size is generated. This behavior is based on a special DisTools version of the PRTools data splitting command gendat, called genddat, e.g.

[T,S] = genddat(D,0.5)

splits D in two dissimilarity datasets T and S, both based on the same repset T. These routines have various other options by which repsets different from T can be generated. It is always assumed that D is a full, square dissimilarity matrix.

Also the classifier evaluation routine cleval has been adapted for the dissimilarity space. clevald has also several options, but as a default the repset equals the training set.

As it is desired to compare the standard PRTools classifiers like svc and knnc in disspace with the k-NN classifier on the given dissimilarities, a special routine called knndc has been constructed which looks like a classifier in disspace, but which operates on the given dissimilarities only. It optimizes k if not set in the call, otherwise it doesn’t do anything during training and uses the given dissimilarities during testing. These are the dissimilarities between a testset and the representation set.

Exercises

  1. Compare for the protein dataset the nearest neighbor classifier on the given dissimilarities knndc with classifiers in disspace. Take knnc , svc and fisherc.
  2. Compute by clevald learning curves of the same classifiers for the prodom dataset of training set sizes up to 500 per class based on repset equals trainset.
  3. Compute by clevald learning curves of the same classifiers for fixed sizes of an external, randomly selected repset, of e.g. 5, 10,20 and 50 objects.

Dissimilarities from feature space

In case not a dissimilarity space is given, but instead a standard feature representation, the dissimilarity approach can still be used but has to be preceded by the computation of a dissimilarity matrix computed from the feature representation. See the examples on this topic. This adds to the standard set of PRTools classifiers all dissimilarity based classifiers, e.g.

W = proxm('d',1);
U = {knndc,knnc,svc,fisherc};
A = gendatb([200 200]);
V = A*W;

e = clevald(A*V,U,[2,5,10,20,50],[],5);
plote(e)

computes a set of learning curves for these classifiers. The standard PRTools classifier fdsc is in fact a classifier operating in dissimilarity space. In can be visualized in a 2d scatterplot:

A = gendatb;
scatterd(A);
plotc(A*fdsc);

It can be observed that fdsc has the property to adapt itself to the non-linearity of the data without the need to specify a particular non-linearity parameter.

Exercise

Study the exp_spiral example of DisTools. Try to beat fdsc .

elements: datasets datafiles cells and doubles mappings classifiers mapping types.
operations: datasets datafiles cells and doubles mappings classifiers stacked parallel sequential dyadic.
user commands: datasets representation classifiers evaluation clustering examples support routines.
introductory examples: Introduction Scatterplots Datasets Datafiles Mappings Classifiers Evaluation Learning curves Feature curves Dimension reduction Combining classifiers Dissimilarities.
advanced examples.

 

Print Friendly, PDF & Email