Scatter Plotting
Six Sigma – iSixSigma › Forums › Old Forums › General › Scatter Plotting
 This topic has 6 replies, 5 voices, and was last updated 14 years, 4 months ago by annon.

AuthorPosts

September 14, 2007 at 12:48 pm #48113
Hi
I am wondering whether it makes any sense to create scatter plots of my Y against the Xs from historical data unless I can actually create situations when I am measuring Y against each X while keeping the other Xs constant. I am feeling concerned because if my Y is not changing much despite changes in one of the Xs, I may be misled by the scatter plot. In reality, the other X may be having a counterbalancing effect and preventing Y to change. In the absence of this X, Y may well change whole heartedly with the first X. So the bigger question that comes up is……..is scatter plot completely useless when it comes to understanding relationships from historical data ?0September 14, 2007 at 1:42 pm #161119
C SureshParticipant@CSuresh Include @CSuresh in your post and this person will
be notified via email.Hi Mark,
Scatter plot can be effective with one X (independent variable) against the Y(dependent variable).
But when you have many Xs, fitting a Regression model will be more helpful.
Participants in this forum are welcome to correct me if I am wrong.
Thanks
C Suresh0September 14, 2007 at 3:38 pm #161128The SP is used to determine the degree of association found between two continuous, independent variables. Using it with happenstance, or historical data is fine. It is often helpful to run potentially causal inputs from an earlier tool (ie C&E diagram, etc) through a quick SP analysis in an initial screening effort. Not a bad idea to use both a graphical and analytical approach (ie pearson correlation coefficient with the SP) when testing for correlation. This will allow you to better quantify the degree of correlation while allowing you to see such things as outliers, nonlinear patterns, etc.
Earlier advice to run a multiple linear regression analysis with your multiple predictors (x) and single continuous response variable (y) would probably be worth while and yield greater insight into the predictability of your factors. Good luck.
0September 14, 2007 at 3:52 pm #161129
Dennis CraggsParticipant@DennisCraggs Include @DennisCraggs in your post and this person will
be notified via email.Try Minitab’s correlation analysis. Some of the X’s may not be independent of each other so they may show correlation also. Then try the “Best Subsets” regression. Here Y is regressed against multiple X’s and the best used for the regression fit. The results are provided in a table that shows the results as X’s are added / removed from the analysis.
0September 15, 2007 at 7:39 am #161160So are we saying that correlation analysis can be used only as an adjunct to regression analysis? Can’t we just use it to screen and select potential predictors? Can’t we make any decisions purely on correlationn analysis alone?
0September 15, 2007 at 2:04 pm #161162
Robert ButlerParticipant@rbutler Include @rbutler in your post and this person will
be notified via email.Annon, Suresh, and Craggs have touched on some of the issues associated with the problems of investigating happenstance/prior/production data.
What you are attempting to do is EDA – Exploratory Data Analysis.
1. EDA is never simple.
2. EDA requires the use of many graphical and nongraphical tools.
3. EDA requires that you use these tools in conjunction with each other.
4. When doing EDA there is no such thing as a single preferred tool or subset of statistical tools. In fact to do successful EDA one should live and breathe Kaplan’s observation concerning scientific investigation – “When attacking a problem the good scientist will utilize anything that suggests itself as a weapon.”
5. EDA isn’t something you are going to knock off in an hour or two.
6. To get a sense of the breadth and depth of the tools/methods that should be employed you might want to read Tukey’s book by the same name.
My usual approach to happenstance data is to first plot the data in as many ways as I can. The current graphics packages offer a lot in this regard. I use scatterplots (2 and 3 dimension) and I make extensive use of the ability to do subset coding which impacts the color and shape of the plotted points. In this way I can get a sense of how other X’s acting in conjunction with those on the axis are impacting the Y. I also do scatterplots of the X’s against one another. If some of the potential X’s are categorical or are fixed at certain specified levels I will also look at boxplots.
Even with all of the confounding that is sure to be present these plots are of value for the following reasons:
1. They will give me some sense of how a Y may actually trend with respect to a given X.
2. They will highlight potentially influential data points.
3. If I use the subset coding it is easy to spot clusters of data points that may either prove to be influential or may not really be appropriate and thus may be safely excluded from consideration.
Once I have this I will run diagnostics on the X matrix – to do this correctly you will need to have the ability to run VIF’s and eigenvalues – the simple correlation matrix won’t suffice but if it is all you have then use it but use it with caution.
The diagnostics on the X matrix will tell me what X’s are reasonably independent of one another and, more importantly, what X’s I couldn’t look at for purposes of model building with this data. I will take this group of X’s and go back to the scatterplots and plot the Y against the group of X’s that can be investigated. These plots will give me a sense of the kinds of terms I should be including (linear, squared, etc.) and they will also highlight influential data point problems.
You may wonder why, if I can do all of this with the X matrix, I would bother with the scatterplots? The answer is this the correlation structure of the X matrix is going to be complex. If you have the ability to run VIFs and eigenvalues you will be able to quantify the relationships between the Xs. Often, you will have multiple choices with respect to what variable to include or exclude in order to give you a final X matrix that is reasonably orthogonal. If you have the graphs in front of you, and if you have taken the time to find out if their plots are physically meaningful, you can make you choices for the X matrix based on something more than the blind application of VIFs and eigenvalues. You will also be able to explain to anyone who may ask why you chose to attribute the trending to a particular X and not one of the other Xs that are partially, but significantly confounded with it. This information will prove to be very useful when it comes time to consider further investigation and the utilization of tools such as DOE.
0September 15, 2007 at 5:43 pm #161167Use whatever methods are useful to your situation. But CA can and certainly has been used to explore potentially causal relationships between process variables (2 inputs, 2 outputs, input v. output), predominantly in an effort to explore and refine the number of predictors you intend to carry forward.
Making a decision on what to take forward into further analysis based on CA alone in my humble opinion would be an acceptable practice, making improvement efforts based solely on this technique would be worrisome to me.
I think there really is only one goal: To understand the process at hand. What does what to what….from here, you can make it do whatever you want. CA can be a small and useful part of that, but will not provide you with the information you need to drive the process.
Just an opinion.0 
AuthorPosts
The forum ‘General’ is closed to new topics and replies.