# Convert data non-normal data

Six Sigma – iSixSigma › Forums › Old Forums › General › Convert data non-normal data

This topic contains 17 replies, has 11 voices, and was last updated by DrSeuss 14 years, 8 months ago.

- AuthorPosts
- October 26, 2004 at 8:33 am #37339
Hello Everyone,

I have been trying to run a multiple regression analysis on a set of data coming from a customer questionnaire. Each customer is asked to rate a specific service, such as delivery time, staff friendliness, etc. The rating is based on a scale of 1 – 10, where 1 is worst and 10 is best.

The Normality test shows that none of the data sets have a normal distribution and I would therefor want to convert my data sets before running multiple regression. Using the Box – Cox Transformation, the calculated Lambda still does not give me any normal distribution (P still is 0,000).

Anyone have an idea on how to proceed??

Thank, Sacci

0October 26, 2004 at 11:12 am #109719What is your Y? Since you are using a scale, you really have discrete, not continuous data. If your Y is also discrete, ex: they buy from us or not, then you might look at doing a logistic regression.

0October 26, 2004 at 12:26 pm #109725Hello Darth,

my Y is indeed a scale – again from 1-10. Running a logistic regression would mean that I only have a yes/no value as response possible.

However, the response can be anything between1 and 10, so how could that fit a logistic regression?

0October 26, 2004 at 12:46 pm #109726Sacci,

You can use an Ordinal Logistic Regression… It works just like the Binary Logistic Regression except the output is Ordinal (1 to 5/1-10/Likert etc)…..

Best Regards,

Bob J0October 26, 2004 at 1:09 pm #109727

Robert ButlerParticipant@rbutler**Include @rbutler in your post and this person will**

be notified via email.You have a couple of options. Which one you can use will depend on the sophistication of your particular regression package.

First a point of order. Neither your Y’s nor your X’s need to be normally distributed. The need for normality applies to the residuals and this need arises because the tests for significance of the model terms assume residual normality. See page 23, point #3 of Applied Regression Analysis, 2nd Edition, Draper and Smith for further discussion.

If you have only a basic regression package you can treat the 1-10 scale as very coarse measurements and run a regression against them. If you do, your residual plots will look “odd” in that if there is no lack of fit the residual plots will look like slanted bands of numbers as opposed to a shotgun blast. There are a variety of problems with this approach but if you don’t have anything else it is better than nothing. A key point to remember if you do it this way is that your prediction error will probably be such that you cannot distinguish between a rating of 1 and a rating of 2. Instead, the error will probably only allow you to group your ratings into two or three categories ( good and bad, or good, neutral, bad)

Since your ratings are ordered, you could use the cumulative logit model, the adjacent categories model, or the continuation ratio model. Of the three, the cumulative logit has the most applicability and in some regression packages the machine will automatically switch to a cumulative logit model when it detects more than two levels in the Y variable. With a cumulative logit, the analysis proceeds the same way it does for binary data and the output is interpreted in a similar manner.0October 26, 2004 at 1:11 pm #109728Yes, Bob is correct. I recommend u, to use (Minitab):

Stat>Regression>Ordinal Logistic Regression…

And remember, you have to check the normality in the residuals.

0November 1, 2004 at 2:24 pm #110074

Jackson LowParticipant@Jackson-Low**Include @Jackson-Low in your post and this person will**

be notified via email.Hello anyone,

I am doing a hypothesis test. Before doing the test, we need to do a normality test on data. After testing, I faced the situation like Sacci.

The Normality test shows that none of the data sets have a normal distribution and I converted my data sets using Box – Cox Transformation, the calculated optimal Lambda still does not give me any normal distribution (P still is 0,000).

Anyone have an idea on how to proceed??

0November 1, 2004 at 2:38 pm #110077

Robert ButlerParticipant@rbutler**Include @rbutler in your post and this person will**

be notified via email.Both the t-test and ANOVA are robust with respect to non-normality. The issue in both cases is variance equivalence. In the case of the t-test there are options which allow corrections for unequal sample sizes and unequal variance.

For two populations you can also use the Wilcoxon-Mann-Whitney test. The test does not assume normality. It checks the null hypotheses that the distribution of an ordinally scaled measure is the same in two indelepdently sampled populations. It, like the t-test, tests the alternate hypothesis – there is a location difference between the two populations. This test is appropriate for any situation where the t-test could be used.0November 1, 2004 at 2:50 pm #110078You should not try to force fit the data to a Normal. The distribution is enherited of the process, so if it is not normal, you should use your six sima knowledge to apply it to the actual distribution of your data. Six sigma is not limited to normal distributions.

0November 1, 2004 at 3:11 pm #110079

Jackson LowParticipant@Jackson-Low**Include @Jackson-Low in your post and this person will**

be notified via email.Dear Robert Butler,

Does Wilcoxon-Mann-Whitney test available in Minitab? I have checked but can’t find it. May you give me an instruction? Thanks.0November 1, 2004 at 3:24 pm #110080You can also try the Mood’s Median test in Minitab as well as the Johnson Transformation if you have Mini 14. You also might want to confirm why the data is non normal. If it due to what and how you collected the data, then you need to deal with that. If it is the reality of the process then either of the two mentioned above might help.

0November 1, 2004 at 3:38 pm #110083

Robert ButlerParticipant@rbutler**Include @rbutler in your post and this person will**

be notified via email.I don’t have access to Minitab so I don’t know if it has the test. Given what Minitab does I would be surprised if it didn’t. The test is also known as the Wilcoxon two-sample rank test and the Mann-Whitney test. A description of how to do this manually would be too long for this forum, however, if you have access to any books on non-parametric methods you should find it listed in the index under one of the three titles. Two references that give a good description of how to build and run the test are:

Statistical Theory and Methodology in Science and Engineering – Brownlee

Practical Non-parametric Statistics – Conover0November 1, 2004 at 3:39 pm #110084Yes, Minitab it does perfrom 1-sample wilcoxon or 1-sample sign test. Just try this

Goto: Minitab>stat>Non-parametric>and you can pick either 1-sample sign or 1-sample wilcoxon.

You0November 1, 2004 at 3:52 pm #110085Man-Whitney test in Minitab.

Goto: Minitab> stat > noparametric > Man-Whitney

Ho: M1=Mtarget

Ha: M1 not equal Mtarget

Remember this test is for Test a Meians test.

Also The Mann-Whitney test is a nonparametric alternative to the two-sample t test with pooled sample variances.

Hope it help.0November 2, 2004 at 5:12 am #110108

DrSeussParticipant@DrSeuss**Include @DrSeuss in your post and this person will**

be notified via email.Sacci,

Darth was trying to point you in the direction of discrete data. Just because you have a 1-10 scale, doesn’t mean you have continuous data. Look at your data, are your values just 1 or 2, or 3…..or 10? Do you have 1.1 or 2.76 or 8.23, type of response values to each question? Probably not. You have discrete data. Also since this is survey data, you must evaluate each question independently. By the way, what was your response rate for the questionnaire? If it is less than 85%, are you going to adjust for non-response bias? Lotsa stuff to think about.

If you do consider using an nominal or ordinal logistic regression, get some help because the intepretation of the results can be tricky. However, this tool could give you event probabilities for each categorical bucket from 1-10 or some other grouping. A lot of questions have a 1-10 response scale, but actually the originator is really interested in the top two box scores or the bottom two or three box scores. This scenario would actually take your 1-10 and segment the responses to either dissatisfied, neutral, delighted, i.e. bottom boxes, middle boxes and top two boxes. This segmentation would make the logistic regression analysis easier. Just some things to think about.

Finally a word about normal distributions….normal data can be expected (not guaranteed) if your variable has a mean or target value and the process allows variation about the target value. Your survey data does not fit this scenario, hence you should not expect it to be normally distributed.0November 2, 2004 at 10:29 am #110111Hi DrSeuss ,

I just have an uncomfortable feeling of having missed that in a similar situation. Could you point me towards some description of how to correct for non-respondent bias?

Regards and thanks

Sandor0November 2, 2004 at 8:51 pm #110148

TucsonTomMember@TucsonTom**Include @TucsonTom in your post and this person will**

be notified via email.If you have access to SPSS you should consider using their ordinal regression function. Although Minitab performs ordinal logistic regression too, SPSS has a neat feature in that it performs a test for equal slopes of the regression lines.

0November 5, 2004 at 4:58 am #110314

DrSeussParticipant@DrSeuss**Include @DrSeuss in your post and this person will**

be notified via email.Sandor,

Most people do not understand the non-response bias correction procedure to adjust your survey results to more accurately reflect the true population response. Here is the link I used to find out more information about this technique: http://nces.ed.gov/statprog/2002/std4_4.asp

The basic formula: Xbar(pop) = Delta(1-R), where R is your initial response rate and Delta is the difference between the Xbar(responded) – Xbar(non-respondents). Only use this adjustment when the initial offering R is less than 85%. Select a sample of non-respondents and send the another offering. Pay them if necessary to get them to participate. Note, the non-response rate must be at least 75%. Hopefully, this will help you get a better understanding of this concept.0 - AuthorPosts

The forum ‘General’ is closed to new topics and replies.