Help needed with selection of Regression Model for SDLC

Six Sigma – iSixSigma Forums Operations IT Help needed with selection of Regression Model for SDLC

Viewing 16 posts - 1 through 16 (of 16 total)
  • Author
  • #53763


    Hi All,

    I have a data set as follows-
    Effort Variance Review Type Req Complexity
    14.5 Peer Simple
    24.6 Peer Moderate
    45.3 Peer Complex
    8 SME Simple
    20.3 SME Moderate
    40 SME Complex
    16 Peer Simple

    The response variable is ‘Effort Variance’ (which is continuous) and the predictors are ‘Attributes’ with 2 to 3 categories. I am struggling to figure out the right regression model for this type data set (is there any?). If an appropriate regression model does not exist, then what is the best way to get regression on this? (happy to use dummy variables).

    I am using Minitab 16, any help at the earliest will be much appreciated.




    Are your predictor variables truly attributes (red or green for example) or are they merely a selected setting of a variable with a continuous range?



    Hi, they are attributes and discrete in nature….

    I am attaching the sample data in excel format. Please note that the Effort Variance values are based on the difference between actual and planned efforts (not shown in the sample)…I am wanting to include the competency data (attributes) to see if there is any relation.

    I am using Minitab 16 and I am not into DOE.



    not sure if you are able to see the attachment, but this is my third attempt! [file name=Sample.xls size=13824][/file] [file name=Sample.xls size=13824][/file]


    Robert Butler

    Given that they are attributes and they all look like the sample you provided then you can do the following:

    The two level attributes can be coded -1 and 1 or 0 and 1. If you code 0 and 1 what you are doing is using the attribute with the code of 0 as the reference. Depending on what it is that you want to get out of the regression this may be a good or a bad choice.

    As for the three level attributes if all of them are like the provided sample then they are ordinal and you can code the levels -1,0,1. If some of them are nominal and not ordinal then you will have to set up dummy variables for those that are nominal and run the analysis with a mix of dummy variables and coded variables.

    The model terms you could entertain would be main effects and linear x linear interactions. You could also look at curvilinear terms for the ordinal variables with three levels but if you do you will face the problem of term interpretation (if the residual analysis indicates the need for such a term some people will test the curvilinear term and, if it resolves the problem, they will note this fact and make it a point to explain that the existence of the term in the model is for purposes of fitting only).

    Once you have everything coded you will have to run an analysis of the X matrix to make sure your variables exhibit enough independence from one another so that they can be included in the regression. For a first look I would recommend just looking at an X matrix comprised of main effect and two way linear x linear interactions.

    The best choice for this would be to run an eigenvalue/condition index assessment of the X matrix. Since, as far as I know, Minitab does not have this capability you will have to go with just running a VIF check. If the terms meet the VIF criteria for inclusion then you can get on with the analysis. If they don’t you will use the results of the VIF check to identify those terms that need to be dropped.

    Once you have identified the terms that can be included in the model you should run both backward elimination and stepwise (forward selection with replacement) and check to see if the two methods converge to the same model. If they do then you will need to run the usual regression analysis (residual check/plot, lof, etc.) to test model adequacy.

    If they don’t converge to the same model you will still have to run the usual regression analysis and you will have to spend some time examining the two equations in order to decide which one is more meaningful. To this end, the results of the regression analysis may help with respect to making decisions concerning model choice.

    If the residual patterns indicate problems – missed linear or quadratic terms, need for variable transformation, etc.then you will have to spend more time investigating and resolving these issues.



    Yeah – what Robert said.



    Thanks Robert.

    This was a very helpful reply! I have a strong feeling that multi-collinearity exists! I will have to look at the data again; however, would you know if we can perform nominal logistic regression? My understanding is that nominal (or binary or ordinal) regressions (support by Minitab) can be performed when the response variable (Y) is nominal/binary/ordinal? Would you know?


    Robert Butler

    If the response variable is nominal or binary you could run a logistic regression, however, if you do this you need to understand that your output will not be in the form of a change in X resulting in some kind of absolute change in Y. What you get with logistic regression is odds ratios.

    So, let’s say you had some Y where 1 is the reference and you ran a logistic against X1 and X2 where X1 was continuous and X2 was a binary yes/no. Let’s say the coefficient for X1 is .074 and the coefficient for X2 was .788 where the reference for X2 was “yes” (=1)

    For X1 the coefficient of .074 gives an odds ratio of exp(.074) = 1.08 thus for each unit increase in X1 the odds of Y occurring (i.e. Y=1) increase by 8%. In a similar vein the odds ratio for X2 is 2.2 which means that the odds of Y occurring when X2 = yes is a little more than twice what it would be if X2 = no. For the cases of nominal X2 where you have more than two levels then depending on how your statistics package is built you will either have to build dummy variables for X2 or tell the program which of the levels of X2 is going to be treated as the reference.

    All of this is a very long winded way of saying that, unless this is some kind of medical analysis, logistic regression probably won’t tell you what you really want to know.

    In the case of Y being ordinal, as in a Likert scale, you can use regular regression for the analysis.



    Brill Rob!

    Thank-you very much. Yup, I dont think Minitab does Eigen values, however, I will follow-your advise and see how I get on!

    Well, we are still on data gathering and this is the sample data that I have got from the team. So may take some time to get cracking with the real-time data!

    Many Thanks for your help!


    David Lengacher

    You can download R for free and using this tutorial, you can construct Generalized Linear Models (GLMs) in about 45 seconds. This set of models is used when your Y is not dist normally, or bound to positive values…something we often forget is a condition for linear regression. Try graphing your Y to ensure it’s not poisson dist. If it is, you may want to use a GLM with a log link function.

    Assessing the model fit is not as straight forward (no r-square or std error), But with a little reading you can find the equivalent things to look at for model and parameter fit.


    Robert Butler

    The statement “This set of models is used when your Y is not dist normally, or bound to positive values…something we often forget is a condition for linear regression. Try graphing your Y to ensure it’s not poisson dist. If it is, you may want to use a GLM with a log link function.” is in error.

    A check of any basic book on linear regression (Draper and Smith – Applied Regression Analysis 1st Edition pp. 17, for example) notes that there are no special restrictions on Y either in terms of the underlying distribution or in terms of positive or negative values. The same is true of the X’s.

    Basic linear regression makes no assumptions concerning probability distributions. All it involves is a number of specified algebraic calculations. When it comes to running tests of term significance the only distribution consideration is that of the residuals of the regression – these must be normally distributed and uncorrelated.


    David Lengacher

    There are several references stating linear regression is inappropriate in situations where Y is not normally distributed. Here is one from a little school called MIT. Just search for the phrase “dependent variable not normally distributed” and you’ll find dozens.

    The standard multiple linear regression model is inappropriate to model this data for the following reasons:


    2.The dependent variable is not normally distributed.

    Also from SAS:

    Thus, you can use the GAM procedure when you have multiple independent variables whose effect you want to model nonparametrically, or when the dependent variable is not normally distributed.



    Thanks Interdisc – I surely have a lot of reading to do!

    I had earlier thought that R-square and Std Error are the parameters to check for a model fit; however, I do understand that additinal factors like Marlow c-p, etc do make an impact!


    Robert Butler

    I took some time to check the references cited in the above post in defense of the statement “This set of models is used when your Y is not dist normally, or bound to positive values…something we often forget is a condition for linear regression. Try graphing your Y to ensure it’s not poisson dist. If it is, you may want to use a GLM with a log link function“.

    In my initial response to that post I provided a reference for the proof that this statement was in error. Because I had done this I assumed the provided citations and links in the rebuttal were to papers or proofs that would demonstrate that, since the time of the publication of the book I cited, things had changed and that it had since been demonstrated that Y had to be normally distributed before linear regression methods could be employed.

    I agree that the author of the paper cited is from MIT and does say in his paper that Y must be normal. I also agree that the SAS manual cited as well as the other references noted express similar sentiments. I further agree that statements such as this can be found in peer reviewed papers in science from almost every institution in the world as well as in various and sundry textbooks – unfortunately, just saying this does not make it so and in every instance, the claim that Y must be normally distributed is wrong.

    If we revisit Applied Regression Analysis by Draper and Smith we will find in both the first and the second edition a very concise discussion of linear regression (pages 8-24 in the second edition). Basically the authors walk through the mechanics of linear regression illustrating with both equations and diagrams the algebraic concept of least squares fitting.

    At the end of the algebra they state, “Up to this point we have made no assumptions at all that involve probability distributions. A number of specified algebraic calculations have been made and that is all…[We now make the following basic assumptions about the error term of the model].”

    1. The errors are normally distributed
    2. They have a mean of zero
    3. They have a constant variance
    4. They are statistically independent

    There are no other assumptions. This same proof with a more mathematical level of detail can be found in Kendall and Stuart’s book The Advanced Theory of Statistics Volume 2. The assumption of normality of the residuals is needed to validate the use of t and F distributions as exact distributions for construction of test statistics and confidence intervals.

    Thus the reason you can’t use linear regression for some types of outcomes, such as binary responses, is not because the distribution of said variable is not normal but because if you used linear regression methods on things like 0,1 data the model residuals will not meet, and could not be made to meet via transforms of the X’s or the Y’s or both, the above criteria. This, in turn means you can’t use the t and F statistics to test model terms.

    If you would like a practical example of the error of the statement about the need for the normality of Y when using linear regression consider the following data set: (note- in the preview screen the numbers look like they are in columns – they may not be in the final submit – if they aren’t the columns are exp x1 x2 x3 x4 y2 and yuni1 – everything except the experiment number and y2 and yuni1 are either 1 or -1). A test of the Y response gives the following:

    Shapiro-Wilk: Pr < W = .0064
    Kolmogorov-Smirnov Pr>D = .0142
    Cramer-von Misis Pr > W-Sq < .005
    Anderson-Darling Pr > A-Sq <.005

    Exp x1 x2 x3 x4 y2 yuni1

    1 -1 -1 -1 -1 3.2089 1
    2 1 -1 -1 -1 3.2348 1
    3 -1 1 -1 -1 0.8059 1
    4 1 1 -1 -1 1.0059 1
    5 -1 -1 1 -1 2.2293 0
    6 1 -1 1 -1 1.8035 0
    7 -1 1 1 -1 0.5818 1
    8 1 1 1 -1 1.8207 0
    9 -1 -1 -1 1 11.2089 1
    10 1 -1 -1 1 11.2348 1
    11 -1 1 -1 1 8.8059 1
    12 1 1 -1 1 9.0059 1
    13 -1 -1 1 1 10.2293 0
    14 1 -1 1 1 9.8035 0
    15 -1 1 1 1 8.5818 1
    16 1 1 1 1 9.8207 0

    The data set is contrived but its choice is based on personal experience from setting up and running experimental designs. From time to time non-normal continuous Y responses like y2 result from efforts of this type. If you run a backward stepwise regression on the main and two way interactions of x1-x4 with a cut point of alpha of .05 you will get a final model where Y = fn( x2 x3 x4 x1*x2 x2*x3). A check of the residual will show they meet the requirements for normal distribution and if you run the usual analysis of the residuals (plots, etc.) you will see nothing of any note. The residual statistics are:

    Shapiro-Wilk: Pr < W = .58
    Kolmogorov-Smirnov Pr>D > .15
    Cramer-von Misis Pr > W-Sq > .25
    Anderson-Darling Pr > A-Sq > .25

    If you repeat the exercise with yuni1 as the response you will also get a linear regression model with significant terms. However, when you analyze the residuals you will see they fail to meet the requirements listed above. Consequently, the linear regression model for the binary data cannot be viewed as correct.

    Since the issue of testing model adequacy has also been mentioned it should be noted that when examining the results of a linear regression the proper method of assessment is that of residual analysis. I don’t know of a single book on linear regression that recommends anything less. The best short summation of this issue that I know of can be found in Regression Analysis by Example by Chatterjee and Price when they state,” It is very important to investigate the structure of the residuals and the data pattern through graphs. A large value of R2 or a significant t statistic does not insure that the data has been fitted well.”

    If you are interested in learning more about linear regression methods, their issues, and assessment methods, I would recommend the Chatterjee and Price book as an excellent first choice.



    Thank-you Rob.

    This was a very detailed and well-explained piece! I thought I knew regression, but turns out that its not all I had thought I knew about! I will definitely look-up for Price and Chatterjee book – I believe you are referring to ‘Regression Analysis by Example’ book? Please can you confirm?

    Many Thanks.


    Robert Butler

    Yes, that’s the book.

    I already mentioned this in the second post but, in light of some of your earlier comments, it bears repeating. The core of linear regression model assessment is residual analysis and the core of that effort is assessment of plots of the residuals against predicted values, independent X values, and anything else that makes sense (for example, if your ability to randomize was limited and you had to either block the experiments or were forced into some kind of non-random run order over time you would want to look at the residuals as a function of these things). You would also want to look at plots of the residuals on normal probability plots as well as in histogram form.

    What you cannot do is assess a linear regression only on the basis of a single or even a group of summary statistics such as R-square, RMSE values, Mallows Cp, etc. These things are necessary but they are not sufficient. The reason they are not sufficient is because they cannot provide any assessment of trends in the residuals and they cannot provide you with an assessment of just how non-normal the residuals might be.

    The last item is particularly important in light of the issue of residual normality. Because of the sensitivity of the various normality tests to deviations from normality it is fairly easy to have one or more of these tests declare significant non-normality if you have a sufficient amount of data. Because there is a very large gray area in the continuum of data from perfectly normal to binary, the big question is this: does the detected deviation matter? …and that is where the plots come into play.

    With the plots you can quickly identify the problems with the residuals (either trending or “odd” data points or “odd” groups of data points). If the issue is residual plot trending then you will have to investigate this using methods outlined in any thorough discussion of residual analysis. If the issue is confined to the appearance of “odd” residuals in the plot then you identify these points and the data points associated with them.

    Once identified you set up your code to exclude the data in question and rerun the analysis to see if these data points are influential. What you do after that will be determined by what you find. If your investigation does not uncover influential data points, if the plots do not indicate the existence of such things as trends or clusters, and if the histogram and normal probability plots of the residual distribution looks “reasonably normal” (yes, this is a judgment call rendered after you have done all of the above) then, since the t-test is reasonably robust with respect to non-normality (See The Design and Analysis of Industrial Experiments 2nd Edition – Davies pp. 48-56 for a discussion of this issue), you will probably be justified in concluding that your model is adequate…and then you can get on with validation efforts.

    If this sound like a lot of work – it is…and it is the difference between “running a regression” which is what any reasonably programmed computer does and “running a regression analysis” which is what anyone attempting to extract information from data should be doing.

Viewing 16 posts - 1 through 16 (of 16 total)

You must be logged in to reply to this topic.