Font Size
Topic Box-Cox and Johnson Transformation

Box-Cox and Johnson Transformation

Home Forums General Forums Tools & Templates Box-Cox and Johnson Transformation

This topic contains 26 replies, has 3 voices, and was last updated by  Chris Seider 9 months ago.

Viewing 27 posts - 1 through 27 (of 27 total)
  • Author
  • #702751

    I want to understand the difference between the box-cox and Johnson transformation

    and what they do if i apply any one of them to the capability analysis ?


    what i can understand from the two pics


    the second one


    And what does it look like without transformational efforts?



    the first pic after transformation and the second pic before transformation


    You said you ran a box-cox transform and a Johnson transform that would suggest you need to show three graph – before, box-cox, and Johnson.

    The raw plot – the second graph, gives the impression that you have not been given all of the data from the process. You have an LCL and your process is slammed right up against it. Unless the LCL is some actual physical limit the plot would suggest you are dealing with lot selection and no one is reporting results below the LCL. If this is the case then you have other things to consider before looking at transforms.


    Sorry @rbuttler & @cseider,
    i had done a mistake the two pics are the same let me show you the correct ones


    The data failed a normality test?

    It seems we’re being too “tool oriented” and the process needs shifting to the left.



    yes the data failed normality test P = 0.005

    I want to know how to read the capability after transformation what it telling me exactly ?


    well, you should consider sharing the data–I have a suspicion. Did you do a capability six pack tool in Minitab or at minimum a histogram and run/SPC chart for the individual readings?


    here is the data
    My specs = 13.9 – 15.9
    Target 14.9


    Capability after a transform gets to be REEEAAAAL interesting and you don’t want to go there – I’m with @cseider – it may have failed a normality test but I would forget that and focus on moving the process to the left. Again I would question the lack of data near the USL (sorry I called it the LCL last time – rented fingers you know) because it does look cherry picked. The other option is to show us the data and let some of us look at it,


    The data in the my previous reply
    but could you explain the transformation that happened ?


    I bet if you recorded to another decimal place and had the precision in the gage the Anderson Normality test would pass–it’s a common reason for failure of this test.

    Some of the others may give an indication you can consider the distribution normal.

    Your graphs clearly show the transformational “transform” that’s done to the specs and raw data. The lambda uses of a 0.5 is square root and can be found in MTB help.


    I did what you sent me but still the same

    you said that the transformation happened to the spces and data , but i can’t found any relation between the 2 number ( i mean after and before transformation)

    noted that it is the first time i deal with the transformation , so i can’t understand anything



    13.9**5 is 518888 which is on the graph–just an example of the lambda being 5 for your example.

    Anyways, process capability isn’t a simple matter. You should know if you have a good gage (which we’ve identified your precision may not be great) and do basic things like run/SPC charts for control, trends, etc.

    This has morphed into a multi-pass set of bullet points.


    So, what you are saying is that your measurements can only be taken to the first decimal place. So,
    Given that the limitation on measurement precision is real,
    Given that the data is really representative of the process
    Given that you are not looking at lot selection

    then what the normal probability plot is telling you is that you have a bi-modal distribution with data points 8,18, and 30 representing the second distribution.

    With a sample size of 30 I could be reading too much into this but if the data points are in time order the sequence of 8,18, and 30 is very close to 10 units of separation. Before thinking about transforms and Cpk I’d want to look at the circumstances associated with those three data points.

    If you can’t find anything I would still not want to consider calculating a process capability nor bother with transforms for the simple reason that, with the data you have, your normal probability plot is telling you that you have a bi-modal process which means it isn’t stable.



    will the process be stable by time or i need to make shifting to left or a modification to the process itself?



    why 13.9**5 ……why 5 especially is it constant factor or depends on what ?


    The 13.9 is the LSL. The screen output shows Minitab determined Lambda = 5 and so you see the transformed LSL is as posted earlier 13.9^5 or 13.9**5 or 13.9 to the 5th power….all the same.


    what is the optimal value of the lambda to be considered ?


    I’m not trying to be mean spirited but you have given me the impression that you are not paying attention to any of the responses you have received for this thread and the thread you started on 28 January with the title “Cpk and Normality”. In that thread you were looking at a 30 sample draw for 9300991 which exhibited most of the features you have shown us for the current analysis of 9326893.

    Your posts give the impression your only interest is calculating Cpk and that you are not at all concerned with your process and its behavior. It appears you have spent a lot of time focused on just trying to take whatever data you have and torture it with transforms in the hopes that somehow the results will pass a bunch of normality tests so that you can claim you have met the normality criteria for the standard calculation of Cpk.

    Back in the 28 January thread I posted the following: “…take your data, plot it on normal probability paper and identify the .135 and 99.865 percentile values (Z = +-3).
    The difference between these two values is the span for producing the middle 99.73% of the process output. This is the equivalent 6 sigma spread. Use this value for computing the equivalent process capability. As I mentioned in an earlier post Chapter 8 in Measuring Process Capability by Bothe titled “Measuring Capability for Non-Normal Variable Data” has the details. It is also my understanding that Minitab has this feature as part of its program so if you have access to that package you can let the machine run the analysis.”

    This method will give you what you want without having to resort to transforms. As I said above, your small sample suggests a bimodal issue with respect to the data for 9326893 which, if true, would suggest the process is not stable and both plots suggest sample truncation. If the Cpk calculation is based on an unstable process and or censored data then the final Cpk won’t mean a thing but at least it will satisfy the craving for a number.

    If you are actually reading and attempting to take advantage of the responses to your questions then, my apologies. To that end I would only reiterate what I said earlier:

    The graphs for the data from 9326893 indicate you have many issues you need to resolve before thinking about Cpk. In addition to what has already been said I’d add the following:

    1. 30 samples – drawn how – what kind of a time interval (were these taken over days or just minutes?). How do you you know these are representative and why did you limit yourself to 30 samples – time? cost? someone somewhere said you needed 30 samples to do anything?, etc.

    2. The plot for 9326893 says you are way off target and it also suggests instability – this is a very big issue and needs to be addressed before thinking about capability.

    3. Both 9326893 and 9300991 look like the data was cherry picked on the right hand side. Was it? How did you check this? If not – then the process must have some kind of physical bound on the high end. Do you know what it is? If not, why not?

    4. If it was cherry picked then you need to find out why and you need to get that rejected data in order to understand exactly where your process is and what it is doing. If there was cherry picking and if you don’t have that data before you start your analysis I guarantee you will go wrong with great assurance.

    5. If your measurement precision is really only to the 10ths place then you need to remember a basic rule of metrology – the science of measurement – the precision of your final estimate/calculation can be no better than the precision of the least precise measure used in generating the estimate/calculation. In other words, based on your data your final Cpk value will only have meaning to the 10ths place. The precision you are showing on the report sheets for things like means, standard deviations, C.I.’s etc. with thee digits after the decimal have no meaning past the first digit after the decimal and are what is referred to in metrology as empty precision.


    Dear @rbutler

    First, I am sorry if i am bothering you with my questions but as i mentioned before I am new in six sigma and this is my first six sigma project and i need to understand more

    second, let me show you what i am working on
    1- i am working on flavor powder and i am measuring the chloride content of it
    2- my goal is to achieve cpk 1.33 according to the chloride content results
    3- the 9326893 is number of batch its quantity is 900 kg packed in 36 carton box (25kg each)
    4- i withdraw 30 sample from 30 carton and measure the chloride content of each
    5- i have worked or one year now to adjust the cpk and go thorough the six sigma project to find the cause and effect of what is wrong and i did some modification in the process to adjust the cpk
    6- before the modifications the process was normal but the cpk was bad
    7- by trying modification one by one i found that the cpk is becoming better but the process isn’t normal which made me confuse
    8- my measurement Gage is well calibrated before each trail
    9- when i studied the six sigma green belt they didn’t mention to me how to do capability for non normal data and how to interpret it

    Finally, what do you mean by the expression (cherry picked)

    Thank you and I appreciate your effort


    It is not an issue of bothering. The issue was the perception that all you cared about was Cpk regardless of process issues.

    I guess everyone has 20-20 hindsight – I wish you had posted your process details earlier because it would have saved a lot of time.

    First – “cherry picking” is the act of choosing samples in order to favor a particular outcome instead of randomly choosing samples and letting the measures be whatever they may be. So, if you didn’t cherry pick the data and if your pre-intervention data was perfectly normal then it would suggest the changes you instituted resulted in a truncation of output on the high side. Since the focus is chloride content of flavoring powder I would assume this is a good thing which means you achieved part of your objective. It also means that, given the LSL and USL the data you get from your process will most likely remain non-normal until you can identify changes in the process that will pull the process back to the target and drastically decrease the variation to permit the data to distribute itself in a somewhat normal fashion.

    Second – Some process details/questions

    1. You said the numbers refer to an individual batch. Therefore is it safe to say:
    a. a batch consists of the output of a single reactor vessel
    b. a batch consists of the output of a group of reactor vessels which have been fed raw material from the same material batches.
    c. a batch consists of product from several reactor vessels all of which have separate raw material feeds and each reactor vessel produces X number of lots of material within a batch run.

    d. or is it something else entirely?

    If the situation is (a) then your samples are not independent and the variation you are measuring is not an accurate assessment of your process – this, in itself, is an issue. Under these circumstances a plot from a single batch is not representative of your process.

    If the situation is (b) then you are most likely sampling a blend. There would still be some question of sample independence but there are bigger issues such as the possibility of one or more reactor vessels producing product that is out of spec and that product, when blended with the output from the other vessels could result in degraded product.

    If the situation is (c) then you would need to have some way of identifying the connection between reactor vessels and product lots before you could say much of anything about your process.

    and if the situation is (d) then all of the above is just an exercise in typing and you will need to provide more details before anyone can offer much in the way of advice.

    Third – your process modifications and Cpk

    You said you had normally distributed product before you made some changes and the Cpk was bad. Now you have non-normal data and a question about calculating Cpk. You can compute a Cpk for non-normal data using the method I described previously – no data transforms, no hat tricks, and no problem with translation back to actual measurements.

    However, before doing any of this I would recommend giving some more thought to what you have done.

    When you say you had normally distributed product before you made some changes do you mean:
    a. Every single plot of every single 30 sample test of each batch of prior product always passed a test for normality and gave a visual impression of near perfect normality.
    b. All/The Vast Majority of your 30 sample test plots of each batch gave a visual impression of approximate normality and passed normality tests most of the time.
    c. If not (a) or (b) then what do you mean by that statement?

    In addition to the question of what you mean by normality of prior measurements you said the Cpk was poor. Was this due to:
    a. Process not centered on target
    b. Process centered but spread much too wide
    c. Process not centered on target and spread much too wide
    d. something else

    Finally, as I said before, you will not be able to claim a Cpk of 1.33 or any other Cpk with of the form X.XX because your data will not support two decimals. You can, of course, ignore this and just let the machine grid out empty precision but you should be aware that the best you yourself can do with the data you have is a Cpk of 1.3 or 1.4 or X.X and no more.


    Hi @rbutler ,

    1-The batch is done in one blender ( a batch consists of the output of a single reactor vessel)
    2- when i said that the process is normal before the modification I mean that i did the analysis of 9 batch ( 30 sample each) to examine the performance of my process before i start the six sigma project
    3- all the batches i examine was normally distributed but the process was not centered to the target most of the outcomes results were at the right side (in the upper range side (from 15.1 to 15.9) noted that the my specs is (13.9 to 15.9) my target is 14.9


    Ok, so all you really had were 9 independent measures of your process and all you actually have for lots 9326893 and 9300991 is a single independent measure each.

    How were these samples taken?

    a. Did you take samples over time as the product moved out of the reactor vessel (in which case you would at least have a measure of top-to-bottom differences of product within a single batch)?
    b. When the product moves out of the vessel is it collected in a single bin or is it parsed into smaller lots? If it is parsed as the material comes out of the vessel and if you know the lot order number and if you take a sample from each lot then you should have measurements that acceptably match a check of top-to-bottom product uniformity.

    The reason for asking is because I’ve worked with very large reactor vessels (10,000 gallons) so I understand the issue concerning independent measures and the compromises that need to be made in order to get on with the work at hand.

    In each case I worked with the engineers on the line to figure out a way to get time sequence sample measures. With data of this type you can look at product uniformity within a batch and at least be guaranteed that you are including within batch variability in your analysis.

    If you haven’t done this check with your engineers to see if it is possible to get samples in this fashion. If you can then do so. If you can’t then you will have to make do with what you have.

    In either case,before you run any Cpk calculations you will need to take the results of multiple lots, combine the measurements, and compute a Cpk based on all of the data.
    …and before you do this you will want to make sure you are including as much of the process variability as possible. To that end you will want to make sure that the batches you are sampling were made with different lots of raw materials and were made over some period of time (at least a week or two) so that other sources of unknown process variation will have a chance to be represented in the measurements you are taking.

    In the interim – I’d recommend going back to your old data (before adjustments), pool all of that data and see what it says about your process before you did anything. Then I would take the results from lots 9326893 and 9300991 (and any other lot you sampled after the adjustments), pool them and compare the two groups.

    You should know the time order of the sampling of the lots so an additional check would be to use time sequence box plots to look at your process over time. Each one of the boxes would be all of the data from a sampled lot. You may see trending in the data that helps you better understand your process and perhaps points to a possible solution.


    looks like a measure of nonhomogeneity? fyi, it’s screaming an MSA must be done if you’re trying to conclude differences among samples.

Viewing 27 posts - 1 through 27 (of 27 total)

You must be logged in to reply to this topic.

5S and Lean eBooks

Six Sigma Online Certification: White, Yellow, Green and Black Belt

Six Sigma Statistical and Graphical Analysis with SigmaXL
Six Sigma Online Certification: White, Yellow, Green and Black Belt
Lean and Six Sigma Project Examples
GAGEpack for Quality Assurance
Find the Perfect Six Sigma Job

Login Form