# Taking Advantage of the Age of Statistics: Part 2

By Peter J. Sherman

Part 1 of this article helped practitioners understand the key drivers for the growth of statistics, introduced some leading analytics competitors and outlined the high-level profit roadmap. Now it is time to begin using the tools. Part 2 describes some statistical tools practitioners can use for predictive decision making.

### Applying the Analytics Tools

Two key statistical tools commonly used to provide valuable business insights are:

1. Analysis of variance (ANOVA), which tests whether or not groups of data have the same or differing means.

2. Regression analysis, which measures the statistical relationship or strength between a response variable and one or more predictor variables. There are two basic types of regression analysis: simple linear (for one predictor variable) and multiple linear (for more than one predictor variable).

For simple or multiple linear regression, the response variable must be a continuous variable. Predictor variables can be continuous or discrete, but must be independent of one another. The beauty of regression analysis is that it creates a mathematical equation to perform forward-looking models.

Simple Linear Regression

Consider a generic business in either the manufacturing or service sector. To improve profitability, the business decides to focus on increasing sales – the response variable. The company’s sales and marketing team shares with the practitioner that price is generally a good predictor variable impacting sales. The practitioner collects quarterly sales data for the past three years with the average retail unit price (Table 1).

Table 1: Sales and Price Data

 Quarter Sales Price Q107 \$1,000,000 \$250 Q207 \$1,500,000 \$248 Q307 \$1,050,000 \$250 Q407 \$1,200,000 \$250 Q108 \$975,000 \$248 Q208 \$1,150,000 \$250 Q308 \$1,225,000 \$250 Q408 \$1,550,000 \$247 Q109 \$1,125,000 \$250 Q209 \$1,250,000 \$250 Q309 \$1,475,000 \$248 Q409 \$1,360,000 \$250

With this data, the practitioner can perform a simple linear regression analysis (Figure 1).

Figure 1: Regression Analysis: Sales versus Price

 Analysis of Variance (ANOVA)Source DF SS MS F P A. Regression 1 1.38528E+11 1.38528E+11 4.91 0.051 Residual Error 10 2.81939E+11 28193859649 Total 11 4.20467E+11B. Regression AnalysisPredictor Coef SE Coef T P Constant 25813509 11086875 2.33 0.042 Price -98596 44481 -2.22 0.051S = 167910 R-Sq = 32.9% C.D. The regression equation is Sales = 25813509 – 98596 Price

Interpretation

A. The p-value in the ANOVA table is a probability, which is compared to the decision criteria, alpha (α) risk. Assume α = 0.05, meaning there is a 5 percent risk of rejecting the null when it is true. In this example, the null hypothesis (H) is that there is no difference in sales using price. The alternate hypothesis (Ha) is that there is a difference in sales using price. A low p-value indicates the factor is significant because it is less than the α risk (5 percent) the practitioner is willing to take. In other words, there is a less than 5 percent chance of being wrong if they reject the null and conclude the variable is significant. Conversely, high p-values indicate the factor is not significant. The p-value in this analysis is slightly above 5 percent alpha risk – indicating that the model estimated by the regression analysis is not significant at a 0.05 alpha (risk) level. Unfortunately, the ANOVA table does not create the regression equation. Regression analysis is needed for that.

B. The regression results indicate the size, direction, and statistical significance of the relationship between a predictor and response.

• Coefficients represent the mean change in the response for one unit of change in the predictor, while holding other predictors in the model constant.
• Sign (plus or minus) of each coefficient indicates the direction of the relationship (positive or negative correlation). For example, price has a negative sign, indicating it is negatively correlated with sales. This makes sense since one would expect higher sales given a lower price and vice versa.
• The standard error (SE) coefficient is the amount of sampling error in the coefficient.
• The t-statistic (coefficient/SE coefficient) is used to test the significance of the coefficient.
• P-values test the null hypothesis that the coefficient is equal to zero (no effect). Low p-values suggest the predictor is a meaningful addition to the model. High p-values suggest the predictor is not a meaningful addition to the model. The p-value is 0.051 (greater than 0.05 alpha risk), indicating price is not significantly related to sales. Predictors that are not significant do not imply they are not useful in the model. It only means that given the other model variables, they are not useful. This is a prompt for the practitioner to explore other predictor variables using multiple regression.

C. The coefficient of determination (R2) value indicates that price (predictor variable) explains only 32.9 percent of the variance in sales. This suggests the model does not fit the data well. The practitioner should identify more significant predictor variables using multiple regression.

D. The regression equation predicts new observations given predictor values. The equation is: sales = \$25,813,509 – \$98,596 price. The coefficient estimate is interpreted as follows:

• Sales will equal \$25.8M per quarter if the independent variable equals zero. Based on historical sales data, something is suspicious with this model.
• For each \$1 increase in price, the company can expect sales to decrease by \$98,596. Conversely, for every \$1 decrease in price, it can expect sales to increase by \$98,596. While this sales swing seems great, the practitioner knows from the p-value and R2 that price is not a good predictor.

### Simple Multiple Regression

The practitioner meets with the sales and marketing team again and they brainstorm other predictor variables impacting sales. The team suggests advertising spending, number of sales employees, frequency of trade shows in which the company participates and even number of competitors in the field. Table 2 contains the data.

Table 2: Variables Potentially Impacting Sales

 Quarter Sales Advertising Price Sales Force Trade Shows Competitors Q107 \$1,000,000 \$17,000 \$250 4 3 Q207 \$1,500,000 \$75,000 \$248 4 4 2 Q307 \$1,050,000 \$20,000 \$250 4 3 Q407 \$1,200,000 \$45,000 \$250 4 1 3 Q108 \$975,000 \$15,000 \$248 3 3 Q208 \$1,150,000 \$25,000 \$250 4 1 3 Q308 \$1,225,000 \$40,000 \$250 4 2 3 Q408 \$1,550,000 \$75,000 \$247 4 4 3 Q109 \$1,125,000 \$25,000 \$250 4 1 3 Q209 \$1,250,000 \$45,000 \$250 4 2 3 Q309 \$1,475,000 \$75,000 \$248 4 4 2 Q409 \$1,360,000 \$70,000 \$250 4 3 2

Because there is more than one predictor variable, the practitioner performs a multiple regression (Figure 2).

Figure 2: Regression Analysis: Multiple Predictors

 Analysis of Variance (ANOVA)Source DF SS MS F P A. Regression 5 4.18076E+11 83615164274 209.84 0.000 Residual Error 6 2390845295 398474216 Total 11 4.20467E+11B. Regression AnalysisPredictor Coef SE Coef T P Constant 5814796 2489130 2.34 0.058 Advertising 3.483 1.062 3.28 0.017 Price -21354 10267 -2.08 0.083 Sales Force 90746 35354 2.57 0.043 Trade Shows 63868 17030 3.75 0.010 Competitors 43873 21198 2.07 0.084S = 19961.8 R-Sq = 99.4% C. R-Sq(adj) = 99.0%D. Sales = 5814796 + 3.48 Advertising – 21354 Price + 90746 Sales Force + 63868 Trade Shows + 43873 Competitors

Interpretation

A. Assume α = 0.05, meaning there is a 5 percent risk of rejecting the null when it is true. The low p-value in the ANOVA table above indicates that the model estimated by the regression analysis is significant at a 0.05 alpha (risk) level. In other words, at least one coefficient in the model is different from zero.

B. The p-values for the estimated coefficients (advertising and trade shows) are both less than 0.05 (alpha risk), indicating these are most significantly related to sales. Predictors that are not significant do not imply they are not useful in the model. It only means that given the other model variables, they are not useful.

C. The R2 value indicates that all the predictor variables explain 99.4 percent of the variance in sales. The adjusted R2 is 99.0 percent, which accounts for the number of independent variables in the model. While both values indicate the model fits the data well, the adjusted R2 is more appropriate here.

D. The regression equation predicts new observations given predictor values. The coefficient estimates are interpreted as follows (holding all other coefficients constant):

• Sales will equal \$5.8M per quarter if each of the independent variables equals zero. Again, common sense tells us something is suspicious with this model.
• For each \$1 increase in advertising, the company can expect sales to increase by \$3.48.
• For each \$1 increase in price, the company can expect sales to decrease by \$21,354.
• For each additional sales person, the company can expect sales to increase by \$90,746.
• For each additional trade show, the company can expect sales to increase by \$63,868.
• For each additional competitor, we can expect sales to increase by \$43,873. (Perhaps more competitors create downward pressure on prices, which in turn opens the market to more customers and thus greater revenues.)

The practitioner could use all the predictor variables in the equation or perform a third regression analysis using the two most significant predictor variables: advertising and trade shows. In general, it is advisable to limit regression equations to a few key predictor variables, particularly if they are still producing a high adjusted R2. Below are the results using advertising and trade shows.

 The regression equation is Sales = 971893 + 2.60 Advertising + 83144 Trade ShowsPredictor Coef SE Coef T P Constant 971893 30577 31.79 0.000 Advertising 2.596 1.599 1.62 0.139 Trade Shows 83144 24403 3.41 0.008S = 32118.2 R-Sq = 97.8% R-Sq(adj) = 97.3%Analysis of VarianceSource DF SS MS F P Regression 2 4.11182E+11 2.05591E+11 199.30 0.000 Residual Error 9 9284227959 1031580884 Total 11 4.20467E+11

Notice the adjusted R2 value is 97.3 percent, which indicates this model still fits the data well.
Now, the practitioner tests the forecast model with some key assumptions:

Advertising budget = \$75,000 per quarter

Trade shows = 4 per quarter

Sales = \$971,893 + (2.6 x \$75,000) + (4 x \$83,144).

Quarterly sales is projected to be nearly \$1.5 million using these two key predictor variable amounts. This is consistent with the historical data. The company now has a valuable forward-looking model to perform sales forecasts. For example, the company could adjust levels of advertising and/or participate in more trade shows throughout the year to achieve certain sales goals.

### Time to Prosper

In crowded and competitive markets, one of the last remaining areas of differentiation lies in the ability for organizations to use statistics and analytics. Using the strategic profit roadmap and applying analytics through statistical tools can enable Six Sigma practitioners to flourish in this new business environment. A note of caution: Statistics is not an exact science. Strong data correlation does not necessarily prove a cause-and-effect relationship. Regression models like those above need to be tempered with the wisdom of experienced managers and Six Sigma professionals.

1. 