clustk

CLUSTK

Feature space centroid-based clustering with K prototypes:

kmeans, kcentres and kmedoids

LAB = CLUSTK(A,K,TYPE,R,MSIZE)
LAB = A*CLUSTK(K,TYPE,INIT,R,MSIZE)

Input
A Feature based dataset with M objects, possibly doubles.
K Vector with desired numbers of clusters, default sampling of [2:M]
TYPE 'kmeans' (default), 'kcentres' or 'kmedoids'.
INIT Vector of length max(K), indices of initial centres or medoids.
INIT = []: default: systematic initialisation by CLUSTF.
R Number of clustering trials based on random initialisations. The best cluster result is returned.
MSIZE Number of objects (M) above which the dataset is preclustered by
CLUSTM, reducing it to MSIZE objects. Default MSIZE = 5000. Use
MSIZE = inf to avoid perclustering.

Output
LAB M*NUMEL(K) array with the results of the multilevel clustering for the M objects. The columns refer to the clusterings. They yield for the objects the prototype indices of the clusters they belong to.

Description

An intial set of K prototypes is iterativly optimised such that the set of objects with the same nearest prototype (a 'cluster') constitutes this protoype exactly as its mean, medoid or centre. The medoid is defined as that object in a cluster for which the mean distance to all other objects is minimum. The centre is defined as that object in the a cluster for which the maximum distance to all other objects is minimum.

The clustering is iterated untill stability or is prematurely stopped by PRTIME. In case of random initialisations (R is an integer > 0) the clustering is repeated R times and the best result is returned. For R = 0 (special case) a systematic initialisation is performed and the resulting clustering is directly returned without optimisation.

LAB is a column vector of length M or an array of length(K) columns. It contains for every object and for every clustering the cluster indices. In case of kcentres or kmedoids they point to the objects that are found as the centres or prototypes. In case of kmeans they point to the objects nearest to the cluster means.

If K is given its values are reduced to less than M/5 to make the routine more feasible. Moreover, if M > MSIZE the dataset A is preclustered by PRECLUST using CLUSTM. Unless specific values of K < 100 are needed it is recommended for fast processing to use K = []. Speed may be further increased by using smaller values of MSIZE, e.g. MSIZE = 500;

Example(s)

randreset;                     % take care of reproducability
data = gendatclust1(20000);    % generate 20000 objects in 10 clusters
                                % Run Mean Shift clustering
lab = clustk(data,[2 5 10 18 30 50 100],'kmeans',[],2000);
                                % Show scatterplot for 10 clusters
figure; scatn(lab(:,3),data,'K-Means');
figure; clusteval(lab,data);   % Evaluation by active learning

Feature space centroid-based clustering with K prototypes:

Description

Example(s)

See also