# Stepwise Regression

Six Sigma – iSixSigma Forums Old Forums General Stepwise Regression

Viewing 23 posts - 1 through 23 (of 23 total)
• Author
Posts
• #30898

Fonseca
Participant

Although we can perform Stepwise Regression in Minitab, I have heard many opinions against this procedure. And when we read the artcles against this procedure we became a little bit confused about which is the correct procedure to search the really significant independent variables (when we have one dependent and several independents…).
I would appreciate any comment.
Thanks,
Marcelo

0
#81100

Robert Butler
Participant

I can’t speak to opinions to the contrary but I do know that, in the statistical literature, stepwise regression is viewed as an excellent method for variable selection. From pp. 310 Draper and Smith, Applied Regression Analysis -2nd Edition we have the following:
” We believe this (stepwise selection with replacement) to be one of the best of the variable selection procedures discussed and recommend its use…..stepwise regression can easily be abused by the “amateur” statistician.  As with all the procedures discussed, sensible judgement is still required in the initial selection of variables and in the critical examination of the model through examination of the residuals. It is easy to rely too heavily on the automatic selection performed in the computer.”
On those occasions when I have been asked to review a regression effort that has been the cause of a lot of disagreement what I usually find is exactly what is mentioned above-someone made the mistake of confusing the misuse of a computer package with statistical analysis.
“Sensible judgment” includes many things:
1. Understanding the data origin. Design data (without failures) differs from design data with failures which differs from production data, which in turn differs from data gathered from a variety of sources.
2. Checking the X’s for such things as confounding, qualitative vs. quantitative, same name different units (plant A measures in psi, plant B measures the same thing in mm of Hg) etc.
3. Never relying on one form of stepwise regression.  I always run forward selection with replacement and backward elimination on every response.  Design data and even design data with failures will usually give the same model regardless of the method chosen (assuming your entry and exit conditions are identical) but there can be surprises.  Any other kind of data will more often than not give a different model for each type of stepwise regression.
4. Plot the data-and look at the plots.  After analysis, plot the residuals and look at the residual plots.
5. Never forget that what you have is a correlation.  You, not the equation, must provide the physical justification for the presence of terms in the model.  If the terms don’t make physical sense-investigate-maybe something is there and maybe it is just dumb luck.
There are other things to consider but I have found that if you pay attention to the above you will find stepwise regression to be a ver y powerful analytical tool. On the other hand, if you attempt to use stepwise regression as some form of automated analysis you will probably go wrong with great assurance.

0
#81103

Fonseca
Participant

Thanks for the answer Mr. Butler.
However I wonder if I am not making a mistake about the regression analysis assumptions. Suppose I have a total defect rate that I can monitore day to day. And I can monitore all the defects rates that compose the total defect rate. It is just a mathematical relation: (total defect rate) = (defect1 rate + defect2 rate +….+ defectn rate). I can´t use a regression analysis in this case because it is a deterministic (mathematical) relation and not a statistical model. Ok. But if I need to find the defects that “drive” the process of failures, I would be choosing some defects. How should I use regression analysis to help me in this case ? What I need is just to substitute the total defect rate by two or three defects that drive the process and therefore fit a linear model to them.
What would be the best way ?

Thanks,
Marcelo

0
#81159

Robert Butler
Participant

If you are trying to find the defects that “drive” the process of failure then you will have to start gathering data on both failure(s) and defects.  For example, when a failure is observed you will have to record the failure type (qualitative or quantitative) and note to defects that are observed when that (those) failure(s) occur.  With this kind of data in hand you will then have to examine the matrix of defects to determine which ones are independent of one another and thus which ones can be used in a regression analysis.
A matrix of independent X’s (types of defects) and the associated Y’s (types of failures) can then be examined using regression techniques to develop a correlation between the two. (I’m assuming that your Y’s are some kind of continuous measurements and your X’s are either discrete or continuous.)

0
#81170

Fonseca
Participant

Robert,
I have already gathered all the data for failure and defects (I update it everyday through a IT report). The idea about finding the dependence among X´s is very useful in order to create a regression model. The problem that remains is that my failure is just the sum of the defects type. I need to know which are the 2 or 3 defects that can substitute the sum of all defects.
Marcelo

0
#81204

Robert Butler
Participant

This is one of those cases where the fog of the written word may cause problems.  As I re-read your posts I’m left with the following picture of your data:  You have n items that you have checked for different types of failure and different types of defects and you are looking for those defects that can best be used to characterize the failures.  If this is the case then rather than using regression it would probably be more appropriate to arrange the data in a contingency table as illustrated below.
Defect Type
Failure Type               D1      D2     D3      D4     D5 …..
F1                            7         12      1         0        10
F2                            2         5       15        1         9
F3                            ….etc.
You would want to include “no failure” in the list of failure types and you would use the Chi squared test for your analysis. Chapter 5 of  Statistical Theory and Methodology in Science and Engineering by Brownlee (2nd Edition) has the details.  If you don’t have ready access to that book check your available statistics texts and look under the subject heading of multinomial distributions and contingency tables.
The Chi squared test will tell you whether or not the different failure types differ in their distributions of defect types, however, it will not tell you in what specific ways the failure types differ. To understand that you will have to look at the table to note the discrepancy between expectation and observation and you will have to draw conclusions based on you understanding of the physical process.
If  I’ve made a mistake in terms of understanding your data structure let me know and I’ll try again.

0
#81209

Fonseca
Participant

Robert,
I think you are completely right about the “fog”….I haven´t explained the situation exactly yet. I will try again. Suppose we have the data below. Note that the “failure” Total is just the sum of the defective proportion (in %) that I record everyday. For email purposes I decided to show only five samples, ok ? Suppose also that I have much more than 5 types of defects. What I need to do is choose 2 or 3 types of defects (among 5 types of defects) that can substitute the “Total” function. In other words, after a regression analysis I would like to express the “Total” by a linear model involving the defects that drive the process. I want to discard the defects that don´t have great significance. Just like that: Total = beta0 + beta1*d1 + beta2*d4. Note that if I enter all the defects in my model I will get a singular matrix and R-Sq=100%. That is because “Total” is just an exact mathematical expression of all the independents variables (defects).
I think that this idea is similar to what we do through Stepwise Regression. That is the reason I decided to use that. But I don´t know if I am making a big mistake…..There so many assumptions to be followed.
Thanks,
Marcelo

d1
d2
d3
d4
d5
Total

0.03
0.02
0.01
0.04
0.03
0.13

0.02
0.02
0.01
0.06
0.01
0.12

0.01
0.01
0.01
0.04
0.08
0.15

0.00
0.01
0.04
0.08
0.05
0.18

0.01
0.00
0.10
0.02
0.01
0.14

0
#81245

Robert Butler
Participant

The idea of a regression equation is to develop a correlation between one or more “causes” and an “effect” (response).  If the regression equation is to be of value it must be possible to vary the causes at will.  As described, you cannot do this.  The percentage of a given defect is whatever it is and all you can do is record it and sum it with other defect percentages to get a total defect percentage.  In order for a regression equation to be of value you would have to develop correlations between things that you can vary at will (process variables-for example) and certain kinds of defects.
I would recommend using the raw defect counts from you current data set to build a pareto chart of the defect types.  This will permit the identification of the most frequently occuring defects.  The focus would then shift to an identification of independent factors that generate the “vital few” defect types.  Once these factors are identified it would be possible to build a regression equation relating them to the frequency of occurence of a particular kind of defect.  These correlation equations could then be combined to identify system conditions that would minimize the most frequently occuring defects and hence permit a control (and a reasonable prediction) of the total defect count.

0
#81262

Fonseca
Participant

Thank you, Robert. I understood exactly your explanation. I will try to find the real independent factors.
Marcelo

0
#81288

Fonseca
Participant

Robert,
I think another point must be taken into consideradion about the situation I described to you. My defects vary through time and it is possible that they show some seasonality or trend. I don´t know if regression analysis fits to this time series situation due to the residuals behaviour. Could you send me any comment about how to consider time variable in regression analysis ?
Thanks,
Marcelo

0
#81338

Dr. Steve W.
Participant

dealing with data that varies with time. You should get a book on Time Series model. A regular regression approach won;t work. Get a book by the great George Box.

0
#81363

Robert Butler
Participant

If you think that the defects are seasonal you should first confirm this by checking such things as changing frequency of defect occurence as a function of time.  You would want to plot these against time for a visual check and then, if it appears that such is the case, you could check it formally using time series analysis.  As was recommended, in order to do this you will have to do some reading.  The book Time Series Analysis by Box and Jenkins is an often referenced text unfortunately, it does require a lot of effort to work through.  Given that you seem to have some doubt about seasonality you might want to first check The Analysis of Time Series, Second Edition, by Chatfield.  Chapter 2 is Titled “Simple Descriptive Techniques” and it may be all that you need for the present.
Even if there is a seasonal effect the issue is still that of finding the most frequently occuring group of defects and defining their drivers.  From this standpoint, the effect of seasonality is not a major issue and, if it is in fact an effect, it may help in the identification of important defect drivers.  For example, if it turns out that your most important group of defects always occur in the summer, you can begin ruling out any kind of driver that would be limited to winter months.
If your situation does turn out to be as described in the above paragraph then there really won’t be any need to have to include time in your final regression equation since the model would only need to be invoked for process control during those months when the critical group of defects is known to impact the process.

0
#81404

Fonseca
Participant

Robert,
The answers that were sent by you and Dr. Steve W. about violating regression assumptions when dealing with time series made me really think. Could we say the same thing when using a Hypothesis Test ? For example, if I have a process whose output I monitore in a daily or monthly basis and I implement an improvement, do you think I should use an ANOVA test or a Kruskal-Wallis test to check if the mean or median has shifted ? Even if I could test for seasonality and I could not find seasonality evidence, the data would remain time data. Is there any variation from ANOVA or nonparametric test that is applicable to time data ? Please suppose that I cannot use any kind of paired test, ok ?
Thanks,
Marcelo

0
#81427

Robert Butler
Participant

Unless you have a very curious process that permits the simultaneous measurement of everything at once, almost all data is gathered over some interval of time. Time, by itself, is not the issue.  If we ignore seasonal changes for a moment, the concerns that Steve W. and I are addressing have to do with the too frequent recording of data.
A key issue of all data gathering efforts is that of independence of measurements.  If the individual measurements are not independent, the data will exhibit significant autocorrelation.  Significant autocorrelation usually means that you will compute estimates of the process variability which will be much less than the actual variation of the process.  Since variance estimates are central to tests for differernces in means, a too small variance estimate will give false indications of significant mean shifts.  Many control methods used the detection of a significant change in the mean as a basis for justifying a change in the process. Changes that are made when none are really needed only serve to increase the overall variability of the process.
If you are going to use process monitoring data to check for significant mean shifts you should check the data for autocorrelation and seasonal changes. If you don’t have access to Minitab or other software that will permit checks of this type you can use the methods in the book by Chatfield that I mentioned in an earlier post.  There are a number of techniques that will allow you to remove the effects of seasonal changes and autocorrelation.  Once this is done, you can test for significant mean shifts in the usual manner.

0
#81450

Fonseca
Participant

Robert,
The gathering of independent measures through the so called rational subgroups are the basis for the Shewhart Control Charts. I understand it very well because I came from a GE Lighting Plant. However dealing with seasonal or autocorrelated data requires some analysis that I am not used to. I will use Minitab to remove such effects (just a try…I don´t know if I will get it) in order to use the usual tests.
Thank you,
Marcelo

0
#81457

BillYbobsmyidol
Participant

Why would we think that coming from GE provides instant credibility on a Six Sigma matter?  The most competent Six Sigma folks I have seen have come from Motorola, Old TI, HoneywellAlliedSignal and Raytheon.  The GE folks I have seen are very good at creating charts showing how great things are.

0
#81459

James Moffatt
Participant

If there is an autocorellation or seasonal component to your data then rational subgroups are not an appropriate data collection method. Rational subgroups will act to obscure useful information.
If there is true seasonality / autocorellation in your data I recommend that you don’t try and remove it.
Autocorellation can be handled by modifying your sampling to ensure that each sample is independent, by masking the autocorellation with the addition of a dummy variable (pre-treatment step – not recommended unless you are an expert) or modifying your model to account for the autocorellation (see any of the time series books already recommended, I would second the Box – Jenkins recommendation, but it can be heavy going). Be aware that a process feedback loop / auto-correction process etc. can often produce an effect that looks like autocorellation, in this instance some measure of the feedback response can be used to correct for the apparent autocorellation, though of course remember that there is likely to be a time delay on this.
For seasonality, a more effective method (rather than removing the seasonality) would be to determine the factor behind the seasonality and incorporate it in the model. For example room temperature can affect many processes, and is often seasonal, you would be better served to incorporate a measure of temperature into your model than to try to remove the seasonality after data collection. This can get a bit involved, but in principle, unless you have an accurate descriptor of the seasonality you can only estimate it (just as any regression model is an estimation), removing this estimation of seasonality from the data will increase the error present in the model. If you have an accurate descriptor of the seasonality, you might as well incorporate it. This will mean that if the factor that controls the seasonality changes unexpectedly your model will reflect this, rather than the change in seasonality turning up as an unexpectedly large error term, which can be awkward to track down.
Finally, you mentioned concern over other hypothesis tests on autocorellated data. (seasonal effects should not affect the test except that there will obviously be an error component to the result of a similar magnitude to the magnitude of the seasonality – assuming you have not incorporated the seasonality into your model)
To correctly answer this for autocorellation you need to understand the cycle time of the autocorellation, and how this relates to the sample taken to estimate the median / mean for the test. If your sampling period to get the median / mean point for your hypothesis test is less than the cycle time for the autocorellation then the value you obtain for the median / mean will be adversely affected by the autocorellation. You will then run the risk of obtaining an incorrect result for your test.
Sorry for the long reply. Hope its of use.

James

0
#81465

Fonseca
Participant

James,
Don´t worry about long replies. I like them very much. Specially when you are a Mechanical Engineer that is trying to deal with Statistics….
I can say that I understand the principles of the procedures or theory that you are trying to teach me. The more difficult part will be doing the seasonal adjustments in Minitab. For sure the first results that I will achieve won´t be so trustworthy.
Recently I created a worksheet that records the number of defects of a process (a Call Center process) in a daily and weekly basis. The effect of seasonality is very easy to be detected just by visual inspection in the daily basis. When we go to the weekly basis graphic we see that the seasonality has desappeared (just by visual inspection). I wonder if the hypothesis test about the mean/medin would give very different results for the two situations. Do you think that if we don´t have seasonality in a data gathering period we don´t need to care about smaller periods ? This would be like changing gathering frequency, wouldn´t it be ?
Thanks,
Marcelo

0
#81558

James Moffatt
Participant

In the situation I think you have, I would suspect that you dont have genuine autocorellation, especially as it sounds like your sampling rate is actually too low, not too high! If the defects are your primary driver for the investigation I would recommend that you examine the process at a sampling rate such that you can spot variation in defects within a day. If you have data available I would do a few year/month seasonality comparisons, do call volumes fluctuate across the year for example, but your main focus should probably remain within the working week. Without knowing more about the problem / situation etc. I cant recommend much, however call volumes tend to vary over periods of less than half an hour, and I would suggest that you need to sample at this rate at least.
Finally, also remember that their can be many types of seasonality, and your process many be subject to several of them. Seasonality may be present at many levels within your data, Yearly (when are the peak time of the year), Monthly, do you have month end / month start loading?, Weekly, as discussed in the first paragraph.
Good luck,
James

0
#81562

Fonseca
Participant

James,
Your advices are very useful. Few weeks ago I didn´t think about seasonality effects like I think nowadays. Of course I am trying to approach the issue in a very practical way as I am not a statistician. Therefore I am affraid of making many mistakes related to violation of assumptions and definition of models.
I am trying to learn alone how to remove seasonality effects from the data using Minitab and some strange things have happened. Using the Auto Correlation Function I get the peaks in the lag 7 if I have weekly seasonality. Ok. But why, in this situation, I always get peak in lag 1 ??? When I got to the seasonaly adjusted data and performed the Auto Correlation Function I got peaks (statistically significant) in all the lags…..If all the lags show auto correlation does it mean the seasonality has been removed ???
Thanks,
Marcelo

0
#81575

James Moffatt
Participant

Marcelo,
I am not sure what you mean when you say “..got the seasonally adjusted data…” have you already done seasonal adjustments?
I cannot really tell from your post what has happened with your data, but here are some thoughts.
Remember that autocorrelation was developed as a method to examine the randomness of data. The autocorrelation result is an indication of correlation between the point under test and a point at any give lag (or window of points, depending on the autocorrelation method). Data that is not random can be predicted. If you have data that varies over a fixed pattern then once the pattern is established you can predict any point in the sequence from any other point in the sequence. This will show up as strong autocorrelation across all lags. True random data will show no autocorrelation at any lag  unless you have a very small or very large data set of course, where you would expect a coincidental correlation occasionally. For the data to useful for standard regression methods without having to modify the analysis for time-series components the autocorrelation should be as low as possible. The exact amount of autocorrelation allowed without adversely affecting your model will depend on a number of factors, including but not limited to, the amount of error in the model, the use to which the model will be put (prediction, knowledge extraction, optimisation etc.), the amount of data available, the degree of correlation between the independent variables and of course the risk you are willing to accept.
If you have adjusted for seasonality and you are now getting strong autocorrelation then I would suggest that you have added seasonality to your data, not taken it away. If on the other hand your unadjusted data is showing strong autocorrelation at all lags then it suggests that it does indeed have a strong seasonal component. From your comments I suspect the former. If it is the latter, than I would be suspicious of your data, as this would require very clean data set (low or no error) to show autocorrelation at all lags. If that is the case and the data really is this clean then you should be able to determine most of the relationships between the independent variables and your dependant variable by eye (call volume vs. defects for example)!
Also, please make sure that you have interpreted the Minitab data correctly (I assume you have!) Remember that the significance of the autocorrelation is determined from the LBQ stat, not directly from the T value or Correlation coefficient. Data that is random or nearly random will appear to have many frequent spikes on the plots, but will show no significance when the LBQ is considered.
Regards,
James

0
#81576

James Moffatt
Participant

Marcelo,
Just a quick note for reference, be aware that it is very rare for any of the data analysis techniques you have been using to suddenly provide an inspirational picture, or to show clearly something that was previously hidden. The techniques are all for quantifying a relationship, or confirming a hypothesis that would normally be apparent from the raw, or only slightly transformed data. They help remove error from, and allow quantification of, analysis, predictions and understanding, not illuminate hidden truths! Not trying to put you off your data analysis, just warning you that if your data is currently leaving you in the dark with little or no idea of relationships between the variables, then it is unlikely that anything we have discussed will change this! An important rule, keep it simple, go back to the raw data plots, look for understanding from simple charts and plots, once you have that understanding, go back to these methods and use them to confirm your findings. There is no magic wand!
If anyone has some good examples where this is not true  where fundamental truths are revealed by this type of data work, I would be most interested in see them! Keep up the learning and trying out of course!
Cheers,
James

0
#81584

Fonseca
Participant

James,
Would you mind to send me your email in order I can show you the data I am dealing with ?
Thank you very much,
Marcelo

0
Viewing 23 posts - 1 through 23 (of 23 total)

The forum ‘General’ is closed to new topics and replies.