Taking Advantage of the Age of Statistics: Part 2

Part 1 of this article helped practitioners understand the key drivers for the growth of statistics, introduced some leading analytics competitors and outlined the high-level profit roadmap. Now it is time to begin using the tools. Part 2 describes some statistical tools practitioners can use for predictive decision making.

Applying the Analytics Tools

Two key statistical tools commonly used to provide valuable business insights are:

1. Analysis of variance (ANOVA), which tests whether or not groups of data have the same or differing means.

2. Regression analysis, which measures the statistical relationship or strength between a response variable and one or more predictor variables. There are two basic types of regression analysis: simple linear (for one predictor variable) and multiple linear (for more than one predictor variable).

For simple or multiple linear regression, the response variable must be a continuous variable. Predictor variables can be continuous or discrete, but must be independent of one another. The beauty of regression analysis is that it creates a mathematical equation to perform forward-looking models.

Simple Linear Regression

Consider a generic business in either the manufacturing or service sector. To improve profitability, the business decides to focus on increasing sales – the response variable. The company’s sales and marketing team shares with the practitioner that price is generally a good predictor variable impacting sales. The practitioner collects quarterly sales data for the past three years with the average retail unit price (Table 1).

Table 1: Sales and Price Data

Quarter	Sales	Price
Q107	$1,000,000	$250
Q207	$1,500,000	$248
Q307	$1,050,000	$250
Q407	$1,200,000	$250
Q108	$975,000	$248
Q208	$1,150,000	$250
Q308	$1,225,000	$250
Q408	$1,550,000	$247
Q109	$1,125,000	$250
Q209	$1,250,000	$250
Q309	$1,475,000	$248
Q409	$1,360,000	$250

With this data, the practitioner can perform a simple linear regression analysis (Figure 1).

Figure 1: Regression Analysis: Sales versus Price

Analysis of Variance (ANOVA)Source DF SS MS F P A.
Regression 1 1.38528E+11 1.38528E+11 4.91 0.051
Residual Error 10 2.81939E+11 28193859649
Total 11 4.20467E+11

B. Regression Analysis

Predictor Coef SE Coef T P
Constant 25813509 11086875 2.33 0.042
Price -98596 44481 -2.22 0.051

S = 167910 R-Sq = 32.9% C.

D. The regression equation is
Sales = 25813509 – 98596 Price

Interpretation

A. The p-value in the ANOVA table is a probability, which is compared to the decision criteria, alpha (α) risk. Assume α = 0.05, meaning there is a 5 percent risk of rejecting the null when it is true. In this example, the null hypothesis (H₀) is that there is no difference in sales using price. The alternate hypothesis (H_a) is that there is a difference in sales using price. A low p-value indicates the factor is significant because it is less than the α risk (5 percent) the practitioner is willing to take. In other words, there is a less than 5 percent chance of being wrong if they reject the null and conclude the variable is significant. Conversely, high p-values indicate the factor is not significant. The p-value in this analysis is slightly above 5 percent alpha risk – indicating that the model estimated by the regression analysis is not significant at a 0.05 alpha (risk) level. Unfortunately, the ANOVA table does not create the regression equation. Regression analysis is needed for that.

B. The regression results indicate the size, direction, and statistical significance of the relationship between a predictor and response.

Coefficients represent the mean change in the response for one unit of change in the predictor, while holding other predictors in the model constant.
Sign (plus or minus) of each coefficient indicates the direction of the relationship (positive or negative correlation). For example, price has a negative sign, indicating it is negatively correlated with sales. This makes sense since one would expect higher sales given a lower price and vice versa.
The standard error (SE) coefficient is the amount of sampling error in the coefficient.
The t-statistic (coefficient/SE coefficient) is used to test the significance of the coefficient.
P-values test the null hypothesis that the coefficient is equal to zero (no effect). Low p-values suggest the predictor is a meaningful addition to the model. High p-values suggest the predictor is not a meaningful addition to the model. The p-value is 0.051 (greater than 0.05 alpha risk), indicating price is not significantly related to sales. Predictors that are not significant do not imply they are not useful in the model. It only means that given the other model variables, they are not useful. This is a prompt for the practitioner to explore other predictor variables using multiple regression.

C. The coefficient of determination (R²) value indicates that price (predictor variable) explains only 32.9 percent of the variance in sales. This suggests the model does not fit the data well. The practitioner should identify more significant predictor variables using multiple regression.

D. The regression equation predicts new observations given predictor values. The equation is: sales = $25,813,509 – $98,596 price. The coefficient estimate is interpreted as follows:

Sales will equal $25.8M per quarter if the independent variable equals zero. Based on historical sales data, something is suspicious with this model.
For each $1 increase in price, the company can expect sales to decrease by $98,596. Conversely, for every $1 decrease in price, it can expect sales to increase by $98,596. While this sales swing seems great, the practitioner knows from the p-value and R2 that price is not a good predictor.

Simple Multiple Regression

The practitioner meets with the sales and marketing team again and they brainstorm other predictor variables impacting sales. The team suggests advertising spending, number of sales employees, frequency of trade shows in which the company participates and even number of competitors in the field. Table 2 contains the data.

Table 2: Variables Potentially Impacting Sales

Quarter	Sales	Advertising	Price	Sales Force	Trade Shows	Competitors
Q107	$1,000,000	$17,000	$250	4	0	3
Q207	$1,500,000	$75,000	$248	4	4	2
Q307	$1,050,000	$20,000	$250	4	0	3
Q407	$1,200,000	$45,000	$250	4	1	3
Q108	$975,000	$15,000	$248	3	0	3
Q208	$1,150,000	$25,000	$250	4	1	3
Q308	$1,225,000	$40,000	$250	4	2	3
Q408	$1,550,000	$75,000	$247	4	4	3
Q109	$1,125,000	$25,000	$250	4	1	3
Q209	$1,250,000	$45,000	$250	4	2	3
Q309	$1,475,000	$75,000	$248	4	4	2
Q409	$1,360,000	$70,000	$250	4	3	2

Because there is more than one predictor variable, the practitioner performs a multiple regression (Figure 2).

Figure 2: Regression Analysis: Multiple Predictors

Analysis of Variance (ANOVA)Source DF SS MS F P A.
Regression 5 4.18076E+11 83615164274 209.84 0.000
Residual Error 6 2390845295 398474216
Total 11 4.20467E+11

B. Regression Analysis

Predictor Coef SE Coef T P
Constant 5814796 2489130 2.34 0.058
Advertising 3.483 1.062 3.28 0.017
Price -21354 10267 -2.08 0.083
Sales Force 90746 35354 2.57 0.043
Trade Shows 63868 17030 3.75 0.010
Competitors 43873 21198 2.07 0.084

S = 19961.8 R-Sq = 99.4% C. R-Sq(adj) = 99.0%

D. Sales = 5814796 + 3.48 Advertising – 21354 Price + 90746 Sales Force
+ 63868 Trade Shows + 43873 Competitors

Interpretation

A. Assume α = 0.05, meaning there is a 5 percent risk of rejecting the null when it is true. The low p-value in the ANOVA table above indicates that the model estimated by the regression analysis is significant at a 0.05 alpha (risk) level. In other words, at least one coefficient in the model is different from zero.

B. The p-values for the estimated coefficients (advertising and trade shows) are both less than 0.05 (alpha risk), indicating these are most significantly related to sales. Predictors that are not significant do not imply they are not useful in the model. It only means that given the other model variables, they are not useful.

C. The R² value indicates that all the predictor variables explain 99.4 percent of the variance in sales. The adjusted R² is 99.0 percent, which accounts for the number of independent variables in the model. While both values indicate the model fits the data well, the adjusted R² is more appropriate here.

D. The regression equation predicts new observations given predictor values. The coefficient estimates are interpreted as follows (holding all other coefficients constant):

Sales will equal $5.8M per quarter if each of the independent variables equals zero. Again, common sense tells us something is suspicious with this model.
For each $1 increase in advertising, the company can expect sales to increase by $3.48.
For each $1 increase in price, the company can expect sales to decrease by $21,354.
For each additional sales person, the company can expect sales to increase by $90,746.
For each additional trade show, the company can expect sales to increase by $63,868.
For each additional competitor, we can expect sales to increase by $43,873. (Perhaps more competitors create downward pressure on prices, which in turn opens the market to more customers and thus greater revenues.)

The practitioner could use all the predictor variables in the equation or perform a third regression analysis using the two most significant predictor variables: advertising and trade shows. In general, it is advisable to limit regression equations to a few key predictor variables, particularly if they are still producing a high adjusted R². Below are the results using advertising and trade shows.

Figure 3: Regression Analysis: Sales versus Advertising, Trade Shows

The regression equation is
Sales = 971893 + 2.60 Advertising + 83144 Trade ShowsPredictor Coef SE Coef T P
Constant 971893 30577 31.79 0.000
Advertising 2.596 1.599 1.62 0.139
Trade Shows 83144 24403 3.41 0.008

S = 32118.2 R-Sq = 97.8% R-Sq(adj) = 97.3%

Analysis of Variance

Source DF SS MS F P
Regression 2 4.11182E+11 2.05591E+11 199.30 0.000
Residual Error 9 9284227959 1031580884
Total 11 4.20467E+11

Notice the adjusted R2 value is 97.3 percent, which indicates this model still fits the data well.
Now, the practitioner tests the forecast model with some key assumptions:

Sales = $971,893 + $2.6 advertising + $83,144 trade shows

Advertising budget = $75,000 per quarter

Trade shows = 4 per quarter

Sales = $971,893 + (2.6 x $75,000) + (4 x $83,144).

Quarterly sales is projected to be nearly $1.5 million using these two key predictor variable amounts. This is consistent with the historical data. The company now has a valuable forward-looking model to perform sales forecasts. For example, the company could adjust levels of advertising and/or participate in more trade shows throughout the year to achieve certain sales goals.

Time to Prosper

In crowded and competitive markets, one of the last remaining areas of differentiation lies in the ability for organizations to use statistics and analytics. Using the strategic profit roadmap and applying analytics through statistical tools can enable Six Sigma practitioners to flourish in this new business environment. A note of caution: Statistics is not an exact science. Strong data correlation does not necessarily prove a cause-and-effect relationship. Regression models like those above need to be tempered with the wisdom of experienced managers and Six Sigma professionals.

Taking Advantage of the Age of Statistics: Part 2

Applying the Analytics Tools

Simple Multiple Regression

Time to Prosper

About the Author

Peter J. Sherman