iSixSigma

Working with large databases – MSA Sampling

Six Sigma – iSixSigma Forums Old Forums General Working with large databases – MSA Sampling

Viewing 12 posts - 1 through 12 (of 12 total)
  • Author
    Posts
  • #51214

    newbie
    Participant

    I am being asked to regress a large data set (>1M lines) and am interested in knowing if I should use the entire population or sample it…and how would I approach the MSA component here? 
     
    Thanks!

    0
    #177143

    BTDT
    Participant

    newbie:We are finishing a project where we wanted to look at all work orders going back four years. We wanted to make sure none of the stakeholders would reject the baseline or metric improvement based on holidays, high-low season, weekends, long weekends, changes of facilities, management initiatives, employee turnover, even economic changes. Whenever we mention sampling, all the objections tend to come out and ears close.As far as the MSA went, we made sure that “cycle time” really reflected the time available for work. Most of our MSA related work involved a very clear operational definition and correcting the data for unscheduled down time, meals, breaks, weekends, training, different shifts at different locations, transfer of work between facilities, handoffs between and within shifts, correcting for coding errors, etc. This was part of our data collection plan.In the end we had 5.4M records to start sorting through. Excel 2007 was used to sort and code the data (about 1,000,000 records for each sheet are allowed). We did a LOT of VLOOKUPS. This came down to about 650,000 unique data values. It gave us great power to say we had looked at EVERY single interaction with EVERY single work order at ALL locations and for ALL equipment types for four years. This really helped with dealing with any resistance at the Analyze phase.Minitab could handle the binary logistic regression with all our data. It was slow, but still chugged away. We felt it would take us much longer to explain a sampling plan than to patiently wait for Excel or Minitab to finish.Best of luck,Alastair

    0
    #177145

    Gary Cone
    Participant

    Alastair,The next time you have to deal with >1 million records, shoot me a
    note. There are some very cool ways of using a text file with all data
    and reading it in Excel and keeping detail records of what you are
    looking for.

    0
    #177147

    BTDT
    Participant

    Gary:Will do. I was doing so much “text-to-data” with combinations of RIGHT((LEFT(..)) to parse the fields, with VLOOKUPS that I had to confirm the logic of the field in one entire column, then do a cut-paste-value to prevent that column from ever being recalculated.I now know we should have been doing all of it in an Access database, but the project kind of grew.Cheers, AlastairP.S. 356MBy in the end.

    0
    #177148

    newbie
    Participant

    Thanks Alastair! That answers that…..

    0
    #177149

    newbie
    Participant

    Alastair,
    One more for you – I will be working with happenstance data, so it is my understanding our findings can speak to correlation and prediction between the response and the predictors, but will not allow us to speak in terms of causation and influence. If this is a correct summary, then I can see that idea wreaking havoc with the bosses:  “You can’t say that X causes Y…what good is that to us then?” etc…
    How would you make this explanation at an executive level?  Thanks!!

    0
    #177150

    Christie
    Participant

    Hi newbie,
    I do this sort of thing at work occasionally.  There are techniques for doing this called data-mining.  Normally you would take about 60% of your data and use decision trees, or some other technique to form a model, and then use the other 40% of your data to validate the model.  I used software called XL Miner because it will split the data with the click of a button.

    0
    #177152

    BTDT
    Participant

    newbie:When you have happenstance data, the scope of influence is restricted to the data in the context that it was collected. You may summarize this by first stating that the data collected was representative of some kind of subset. For example, “For the NW sales office during the first quarter (Q1)”. This means any observations apply to only those circumstances.What you can say with respect to the lack of causation is something like, “an increase in sales is associated with a decrease in humidity.” This does not imply causation in either direction and allows for another variables influencing both.I find that a thorough data collection plan with a LARGE number of other potential factors allows you to rule out many noise factors and can focus the team on what were likely to be the most logical cause-and-effect relationships. In other words, don’t just test factors because you can, but test the ones the team think MIGHT logically be important. This means using the team’s subject matter experts and benchmarking previous studies.The result should be a list of largest to smallest factors. This result has probably not been shown to the team before and beats the pants off everyone voting for the factors that have the largest influence.Cheers, Alastair

    0
    #177153

    BTDT
    Participant

    That should read, “scope of inference,” not, “scope of influence.”Cheers, Alastair

    0
    #177174

    newbie
    Participant

    Alastair,
    How do you prioritize the list of potentially causal factors using SME, etc without some form of opinion gathering (ie voting). 
    The result should be a list of largest to smallest factors
    Were you referring to the factors being listed after they were regressed by magnitude and direction of effect?  Thanks!

    0
    #177178

    BTDT
    Participant

    newbie:Yes, the factors can be listed in order after you have completed the regression. When you have such a large database, it is possible that the factors can have a very small, but still significant effect. When you have such a large dataset, the size of the effect is much more important than whether a factor is significant or not.If I were in the audience, I would want to know which “levers” are associated with the largest effect. We will still run a pilot project to confirm causation. The regression, even though it is on happenstance data, gives me better support for Improve strategies than opinion or voting.I have shown the results in the form of a Pareto chart with error bars on each factor. Excel will allow you to do some sort of a high-low-close chart that you can adapt.Cheers, Alastair

    0
    #177179

    newbie
    Participant

    Alastair,
    Thanks so much for all the help!

    0
Viewing 12 posts - 1 through 12 (of 12 total)

The forum ‘General’ is closed to new topics and replies.