Effect of Multiple Categorical Variables on Continuous Dependent Variable
This topic contains 5 replies, has 4 voices, and was last updated by Bob Behrens 6 months ago.
- June 13, 2018 at 9:09 pm #56020
Hoping someone could give me some pointers on what methods to use in MiniTab to understand the effects of multiple categorical variables e.g.:
OperatorID, MachineType, Shift (Morning/Afternoon/Night) etc.
on a continuous variable we’re trying to reduce and control (Process Cycle Time) in a production process.
The aim is to reduce our cycle time by identifying and controlling the key sources of variabilityJune 14, 2018 at 12:22 am #202670
Well I’d use boxplots/scatterplots first to see any differences without being as skewed by outliers. Doing this for each category with reference to cycle time to see which show a significant variation. How many data points do you have for each categorical variable? I’m not very experienced so I don’t want to go off the garden path with you but that’s what I would be looking at before anything elseJune 14, 2018 at 5:19 am #202671
As @Daniel.S said – a good first step is data plotting and the methods he suggested are appropriate. The big issue would be to check for possible extreme points that might impact any additional analysis and to check for co-linearity (confounding) of the qualitative variables.
Confounding of variables – for all categorical variables a simple first look would be a series of matricies with one categorical variable against the other. What you would look for is the presence of combinations in each of the matrix cells.
After proper coding of the categorical variables via dummy/indicator variables (see below) you would also want to run VIF and condition indices for multicolinearity checks. Most statistics packages do not have the ability to run condition indices but almost all of the good ones (Minitab, Statistica, etc.) have VIF capability.
As for prediction the easiest thing to do for categorical variables would be a multiple regression using the categorical variables as predictors. To do this you will need to build a series of dummy variables in order to use the categorical variables in a regression analysis (note: if your categorical variable has only two levels then a simple coding of either 0,1 or -1,1 for the two levels is sufficient).
The basic coding structure is as follows:
Let’s take shift as the variable of interest.
For shift A we will define dv1 = 0 and dv2 = 0.
For shift B we will define dv1 = 1 and dv2 = 0.
For shift C we will define dv1 = 0 and dv2 = 1.
Therefore the regression equation form for cycle time would be:
Cycle time = b0 +b1*dv1 +b2*dv2.
Effect of Shift A : Cycle time = b0 +b1*(0) +b2*(0) = b0
Effect of Shift B : Cycle time = bo +b1*(1) +b2*(0) = b0 +b1
Effect of Shift C : Cycle time = b0 + b1*(0) = b2*(1) = b0 +b2
You should do some reading on this subject before trying to run your analysis. A good reference on this subject is Chatterjee and Price’s book Regression Analysis by Example. You should be able to get this book through inter-library loan. The Chapter of interest is “Qualitative Variable Analysis”.June 14, 2018 at 9:29 am #202673
@willtfj I normally do not put up a response to anything that Robert Butler has commented on basically because he is so rock solid on analysis I real am not going to add much value.
I am making an exception here only because I have a tool I really like to use in the beginning of a project that can create a very nice level of understanding very quickly. It takes some planning on the data collection side to have the analysis to come out properly. If you create a Multi-Vari chart you can create a graphic that will allow you to understand if you have within piece, piece to piece or across time variation. That is a huge piece of information when it comes to truncating potential causes.
Just a warning. Very hard core statisticians will from time to time give some flack about these charts not being statistically sound. Regardless they are a very good visual tool particularly for aligning a team or explaining your progress to leadership.
Just my opinionJune 18, 2018 at 8:56 am #202685
Another approach – take your top ten and bottom ten cycle times and do a simple side-by-side comparison of the categorical variables. Consider that a process variable was machine A vs machine B. If your ten low cycle times align mostly (say, 8 or more) with machine A and your ten high cycle times align mostly with machine B, then you have a clue (not a conclusion) that machine type is a possible driver of cycle time. Of course, you would investigate further.June 18, 2018 at 1:40 pm #202688
I like multi-vari as well initially. While it only gives you a comparison of means, unless your data is skewed with a lot of outliers, it will give you a good starting point. Also, keep in mind that your response, cycle times, will not be normally distributed. Once you complete the Multi-vari, then look at the primary offenders, comparing them with statistical tools.
You must be logged in to reply to this topic.