In the Analyze phase of a DMAIC (Define, Measure, Analyze, Improve, Control) Six Sigma project, potential root causes of variations and defects are identified and validated. Various data analysis tools are used for exploratory and confirmatory studies. Descriptive and graphical techniques help with understanding the nature of data and visualizing potential relationships. Statistical analysis techniques, such as hypothesis testing and regression, are used to validate the root causes.

While one of the statistical methods widely used in the Analyze phase is regression analysis, there are situations that warrant the use of other nonparametric methods. Violation of the basic assumptions of normally and independently distributed residuals, and the presence of nonlinear relationships, are the most common situations where using a nonparametric method, such as a classification and regression tree (CART), is more appropriate. In addition, CART can be appropriate in service industries such as banking and healthcare where many potential causes of variation and defects are categorical in nature (e.g., geographical locations, products, channels, partners). The problem with using regression or generalized linear models (GLM) in such cases is that a lot of dummy variables make it difficult to interpret the results. CART is a useful nonparametric technique that can be used to explain a continuous or categorical dependent variable in terms of multiple independent variables. The independent variables can be continuous or categorical. CART employs a partitioning approach generally known as “divide and conquer.”

How CART Works

Assume there is a set of credit card transactions labeled as fraudulent or authentic. There are two attributes of each transaction: amount (of transaction) and age of customer. Figure 1 displays an example map of fraudulent and authentic transactions.

The CART algorithm works to find the independent variable that creates the best homogeneous group when splitting the data. For a classification problem where the response variable is categorical, this is decided by calculating the information gained based upon the entropy resulting from the split. For numeric response, homogeneity is measured by statistics such as standard deviation or variance. (For more information on this please refer to Machine Learning with R by Brett Lantz.)

Two important parameters of the CART technique are the minimum split criterion and the complexity parameter (Cp). The minimum split criterion is the minimum number of records that must be present in a node before a split can be attempted. This has to be specified at the outset. Cp is a complexity parameter that avoids splitting those nodes that are obviously not worthwhile. Another way to consider these parameters is that the Cp value is determined after “growing the tree” and the optimal value is used to “prune the tree.”

In this example, Figure 2 shows that the first rule formed is x2 > 35 → fraudulent transaction. Similarly, other rules are formed as shown in Figures 3 and 4.

In this way, the CART algorithm keeps dividing the data set until each “leaf” node is left with the minimum number of records as specified by minimum split criterion. This results in a tree-like structure as shown in Figure 5. The Cp value is then plotted against various levels of the tree and the optimum value is used to prune the tree.

Application of CART

The following example contains a hypothetical dataset of 600 dispatch transactions of a bank.

The dependent variable is the attribute “defective,” which is a categorical variable with two classes (yes and no). Each transaction is labeled either “yes” or “no” based on whether there is any printing error in the deliverable. The independent variables are “amount,” “channel,” “service type,” “customer category” and “department involved.” The first step in applying any analytical method is to explore the data using descriptive statistics. Assume that in exploring the data all of the independent variables seem to have a significant relationship with the dependent variable. In order to carry out the CART analysis, the dataset is randomly split into two sets, the training and testing sets. Nonparametric studies are not based upon theoretical-probability distributions; it is widely accepted practice to build a model on one set of data and test it on another. This helps in ascertaining the accuracy of the model on unknown future records.

The CART model is used to find out the relationship among defective transactions and “amount,” “channel,” “service type,” “customer category” and “department involved.” After building the model, the Cp value is checked across the levels of tree to find out the optimum level at which the relative error is minimum. The optimum Cp value is then used to prune the tree.

Post-pruning, the “final” tree can be created as shown in Figure 8. The model can also be validated against test data to ascertain its accuracy.