# Problem with Data Collection

Six Sigma – iSixSigma Forums General Forums Implementation Problem with Data Collection

Viewing 8 posts - 1 through 8 (of 8 total)
• Author
Posts
• #55075

Amit Kumar Ojha
Participant

Hi All,

There is a situation where I need some suggestion.

There is a process in which although the population size is more that 1000 however since data collection is extremely tedious and costly, hence we have no option but to go ahead with the sample size of 15-20.
Although I am using the appropriate statistical test (since sample size is less than 30, using T test for comparing mean etc), yet it is probable that top management may question the validity of the inferences drawn based upon such small sample size.

Can anyone please suggest whether to justify the results obtained or give some other alternative.

0
#198549

Robert Butler
Participant

When you say you have a population size of more than 1000 and you can only justify a sample of 15-20 because of time/money constraints you give the impression that the samples already exist and that it is a matter of just choosing 15 or 20 for analysis. If this is the case then you need to use a random number generator to randomly assign the numbers 1-1000 to you data set and then sort on the randomized number and take those numbering 1-15 or 1-20.

You can cite the use of a random draw as insurance that the sample is as representative of the 1000 as a random sample of 15-20 can be. The other thing you can do is show the percent change or actual numerical change in the confidence limits (of your mean?) as the sample size goes from 15 or 20 to 30, 60, 120. For 15 samples the multiplier is 2.13, for 20 it is 2.086, for 60 it is 2 and for 120 (which would be more than 10% of your sample) it is 1.98. The difference in the width of the confidence limits around the mean between a sample size of 15 and one of 120 is all of 7%. Unless you have some very stringent requirements that difference shouldn’t be much of a cause for concern.

0
#198550

Amit Kumar Ojha
Participant

Thanks Robert for the reply. I understood what you are suggesting me to do. But the problem we are facing is that I can not randomly choose the sample. Fetching the data for any randomly selected data is again difficult. Actually there is a system in which a lot of data is just entered and it is not organized. In order to fetch the data for any sample, lot of steps are required such as confirming on the what the attribute value implies, attribute mapping with scenario, validating the meaning / nomenclature with users etc.

Hence I think the second approach which you have mentioned is more appropriate here. I will try it once and see whether the management is convinced.

Thanks for the help!!!!

0
#198578

Sylvain
Guest

Just my point of view on this: your mamangement could be right into how robust this could be…

What about purpose of your sampling? What do you try to catch with this sampling?

If you try to catch big variations between 1000 lots, 15 to 20 could be enough

But if critical CTQs are based on small sigma variation….

You might go into Power and Sample Size, a key chapter into Minitab and/or any 6 Sigma training

Best regards

S.T.

0
#198579

Chris Seider
Participant

@AmitOjha

I don’t understand “can’t randomly sample” in your string of comments above. One can always randomly select something.

Even an approach of getting data across a date range can be as basic as “I’ll select one every X orders for each day of the week”. If multiple locations are part of the process, I’d gather data across all the locations and be sure to get “X orders for every day of the week”.

If it’s paper, it’s really easy to get “random” samples but pulling so many out of the stack.

Just some thoughts to consider from my two cents.

0
#198601

Amit Kumar Ojha
Participant

@ Chris.. You are absolutely right, however I think that I did not make the problem clear enough. Let me try once more:
See we have a process wherein data (numerical) is entered into two different systems (I understand this is redundant and needs to be eliminated at first) pertaining to few entities(multinomial) which are named differently in the two systems. Now I need to analyse this data in terms of variation between two systems, distribution among entities, any patterns etc. The problem is if I list down the data and try to select randomly I end up in a situation wherein for the selected sample, I can’t fetch the details (owing to nomenclature, inconsistency in data entry etc).

Data is huge.

@Sylvain
: Right now my objective is Exploratory Data Analysis wherein we need to gauge the problem so as to formulate the Business Case (with Quantification).

Hope I have elaborated the problem well enough for your further valuable inputs.

0
#198602

Amit Kumar Ojha
Participant

@chris another way of doing it is to first collect all the data which can be easily fetched which would be around 40-50 out of 1000 (I am taking the data only for the current year) and then listing those 40-50 data points and randomly selecting 15-20 out of it.

But then it would not be considered as random sampling. Thats why what I did is to first select some 30-40 data points randomly from 1000 and then ignore those for which data is not easily available, which gave me 15-20 samples.