dtc

DTC

Verzakov Tree - Trainable Decision Tree Classifier

W = DTC(A, CRIT, CHISQSTOPVAL, PRUNE, T, NFEAT)

Input
A Training dataset.
             Object weights and class priors:
             It is possible to assign individual weights to dataset
             objects by using 'weights' identifier:

A = SETIDENT(A, WEIGHTS, 'weights')

             If weights are not defined they assumed to be equal to 1 for
             all objects. The actual weights used by the training routine
             are computed by using supplied weights along with the class
             priors (if priors are not set then the apparent priors are used):

ACTUAL_WEIGHTS(I) = M*(WEIGHTS(I)/CW(C))*(PRIOR(C)/SUM(PRIOR))

             where M is the total amount of (labelled) objects in A,
             CW is the vector weights sums for each class, and C is
             the class of the object I. The sum of all actual weights is M.

             Feature types:
             Features are treated differently based on their domain information.
             If the feature domain is empty or is a collection of intervals
             then this feature is considered to be the continuous one and
             branches are created by splitting at the threshold value.
             Otherwise (feature domain specifies a set of values or a set
             of names), the feature is considered to be the nominal one
             and branches corresponding to all present feature values are
             created.

             Unknown and 'non-applicable' values:
             If the feature value is NaN then it is assumed that its value
             is unknown. In such a situation the object with unknown value
             is split into fractions and sent down all branches.
             The concept of the 'non-applicable' feature value is different
             from the concept of the unknown (missing) value.
             The 'non-applicable' values for the continues features are
             encoded as INF. If feature domain is the (set of) interval(s),
             then INF value has to be explicitly added to the domain
             defintion. The 'non-applicable' value of nominal features
             does not have predefined encoding. If it is necessary user
             have to include such value into domain definition.

     CRIT   Splitting citerion name.
            'igr' Information Gain Ratio (default):
            As defined by Quinlan. The penalty on the number of the distinct
            values of the continues feature is used. If the gain is zero or
            negative due to such penalisation, the split is not performed.
            This leads to smaller trees and may give non-zero training error.
            This criterion does not use costs. (Costs are used only at the
            classification step).

            'gini' Gini impurity index. More precisely, the change in this
            index. GINI index can be interpreted as a misclassification rate
            for the stochastic prior based classifier, so costs are
            naturally embedded. If the change in the (absolute) error less
            or equal to 0.1 (change in the cost less or equal to 0.1 of
            minimal absolute value of non-zero costs) the split is not
            performed. This leads to smaller trees and may give non-zero
            training error.

            'miscls' Classification error criterion.
            To be used only for educational puposes because
            it gives rather inferior results. Costs are naturally embedded.

     CHISQSTOPVAL specificity = (1-significance level) of chi-squared test
            on the difference between the original node class distribution
            and branches class distributions. The null hypothesis is that
            distribution does not chage after the split (in other words:
            class labels and splitting are independent). If null hypotesis
            is accepted then splitting is stopped and current node becomes
            a leaf. CHISQSTOPVAL should be in the [0, 1] range.
            E.g., CHISQSTOPVAL = 0.95 means that significance level is 5%
            and there is only 5% chance of the incorrect rejection of the
            null hypothesis that there is no change in class distribtion
            after the split. (optional; default: 0, which means that
            branches will be discarded only if they bring no change in class
            distribution)

     PRUNE  Pruning type name
            'prunep' - pessimistic (top-down) pruning as defined by Quinlan.
            Using 'prunep' option while trainin a tree for a dataset with
            a defined cost matrix will cause an error.
            'prunet' - test set (bottom-up) pruning using the (required)
            dataset T.
            These implementations of both pruning algorithms may be not
            exactly correct if there are unknown values in datastes.
            (optional; default: '' -- no prunning)

T Test test for the test set pruning

     NFEAT  The number of features to be randomly selected to find a split
            at each node: to be used to created random forest.
            If NFEAT == [] then all available features will be used.
            (optional; default: [])

Output
W Classifier mapping

Description

If (full grown) branches of (sub)tree do not improve classification error (misclassification cost) they are immediatley discarded. This may happen because we use regularised posteriors. As a result the algorithm is more stable, trees are smaller, but splits on the training set may be not perfect.

Reference(s)

[1] J.R. Quinlan, Simplifying Decision Trees, International Journal of Man-Machine Studies, 27(3), pp. 221-234, 1987. [2] J.R. Quinlan, Improved use of continuous attributes in C4.5. Journal of AI Research, 4(96), pp. 77-90, 1996.

Verzakov Tree - Trainable Decision Tree Classifier

Description

Reference(s)

See also