Distools User Guide, 2 Hour Course, 4 Day Course
Computing Dissimilarities, Manipulation , Visualization
Dissimilarity Matrix Classification, Dissimilarity Space, PE-Embedding, Evaluation

Dissimilarity space

This page belongs to the User Guide of the DisTools Matlab package. It describes some of its commands. Links to other pages are listed above. More information can be found in the pages of the PRTools User Guide. Links are given at the bottom of this page.

The dissimilarity space is postulated as an Euclidean vector space. Objects are represented in this space as vectors of given dissimilarities to a representation set of other (or the same) objects. It can be treated in a similar way as the traditional feature space in which the features are defined as dissimilarities to the representation set.

Users have to decide whether they want to use a separate representation set, different from the training and test set. Alternatively they can use (a part of) the training set for representation and may even include test objects as there is no need that the representation objects are labeled. The main routine to organize this is genddat. Here are a few examples based on a square dissimilarity matrix D of size 100*100 with two classes of 40 respectively 60 objects and class priors 0.4 and 0.6. Dissimilarity trainsets DT and testsets DS with the same repset are generated.

% 60% for training, repset is trainset
[DT,DS] = genddat(D,0.6)

% DT is a 60 by 60 dataset with 2 classes: [24  36]
% DS is a 40 by 60 dataset with 2 classes: [16  24]

% 60% for training, 10% of the trainset for the repset
[DT,DS] = genddat(D,0.6,0.1)

% DT is a 60 by 7 dataset with 2 classes: [24  36]
% DS is a 40 by 7 dataset with 2 classes: [16  24]
% Note that fractions in numbers of objects are rounded up.

% 60% for training, 0% of the trainset, 10% of the testset for the repset
[DT,DS] = genddat(D,0.6,0,0.1)
% DT is a 60 by 5 dataset with 2 classes: [24  36]
% DS is a 40 by 5 dataset with 2 classes: [16  24]

% similar, but exclude repset from testset
[DT,DS] = genddat(D,0.6,0,0.1,'ex')
% DT is a 60 by 5 dataset with 2 classes: [24  36]
% DS is a 35 by 5 dataset with 2 classes: [14  21]

Once a training set is available by a dissimilarity matrix stored as a dataset DT, it is interpreted in the dissimilarity space approach as a set of training vectors. In principle any classifier suitable for a traditional feature space can be used here as well. An independent test set represented in the same space by DS can be used for evaluation.

The strength of this approach is that it can be used for any dissimilarity measure, Euclidean, non-Euclidean, metric, non-metric, symmetric or non-symmetric. A drawback is that by this generality the characteristic of the dissimilarity values themselves (positive numbers with the meaning that smaller values indicate more similar objects) is not used. A second drawback is that the dimensionalities can be large, e.g. as large as the training set or even larger (possibly causing high correlations as well). This should be taken into account in the training of classifiers.

A solution for both drawbacks is to reduce the representation set significantly. This can be done by a random selection as shown above by the genddat function. There are in addition various approaches to do this systematically, e.g. the feature selection routines of PRTools. These don’t use the characteristics of dissimilarities but are based on the class separability in the dissimilarity space.

A faster solution is the use the object properties as offered by the individual dissimilarities. The DisTools routine protselfd offers some supervised and some unsupervised possibilities.

Here are some examples. They all compute a selection routine W that may be used by [D1,D2]= D*W, in which D1 contains the selected prototypes for the representation set and D2 contains the other ones.

W = D*featseli;
Individual selection using the 1NN LOO criterion
W = D*featself([],'maha-s',K); Forward selection of k prototypes based on the Mahalanobis distance
help feateval Lists possible criterions for feature selection
W = D*protselfd([],CRIT);
D2= D*W(:,1:10);
Rank the prototypes (see help protselfd for criteria).
Use the first 10.
[DT,DS] = genddat(D,0.5);
W = DT*protselfd([],CRIT,10);
[DT1,DT2] = DT*W;
[DS1,DS2] = DS*W;
Split the dataset in a training set and a test set.
Compute a selection of 10 prototypes on the training set.
Apply it on the trainset, compute the complementary set as well.
Apply it on the testset, compute the complementary set as well.

Once a proper repset has been selected and applied to the testset as to the trainset, any PRTools classifier can be trained and tested using its test routines testc and testd.

PRTools User Guide
elements: datasets, datafiles. cells and doubles, mappings, classifiers, mapping types
operations: datasets, datafiles, cells and doubles, mappings, classifiers, stacked, parallel, sequential, dyadic
commands: datasets, representation, classifiers, evaluation, clustering and regression, examples, support

Print Friendly, PDF & Email