# Problem with Data Collection

Six Sigma – iSixSigma › Forums › General Forums › Implementation › Problem with Data Collection

- This topic has 7 replies, 3 voices, and was last updated 4 years, 5 months ago by Robert Butler.

- AuthorPosts
- July 13, 2015 at 4:09 am #55075

Amit Kumar OjhaParticipant@AmitOjha**Include @AmitOjha in your post and this person will**

be notified via email.Hi All,

There is a situation where I need some suggestion.

There is a process in which although the population size is more that 1000 however since data collection is extremely tedious and costly, hence we have no option but to go ahead with the sample size of 15-20.

Although I am using the appropriate statistical test (since sample size is less than 30, using T test for comparing mean etc), yet it is probable that top management may question the validity of the inferences drawn based upon such small sample size.Can anyone please suggest whether to justify the results obtained or give some other alternative.

0July 13, 2015 at 5:29 am #198549

Robert ButlerParticipant@rbutler**Include @rbutler in your post and this person will**

be notified via email.When you say you have a population size of more than 1000 and you can only justify a sample of 15-20 because of time/money constraints you give the impression that the samples already exist and that it is a matter of just choosing 15 or 20 for analysis. If this is the case then you need to use a random number generator to randomly assign the numbers 1-1000 to you data set and then sort on the randomized number and take those numbering 1-15 or 1-20.

You can cite the use of a random draw as insurance that the sample is as representative of the 1000 as a random sample of 15-20 can be. The other thing you can do is show the percent change or actual numerical change in the confidence limits (of your mean?) as the sample size goes from 15 or 20 to 30, 60, 120. For 15 samples the multiplier is 2.13, for 20 it is 2.086, for 60 it is 2 and for 120 (which would be more than 10% of your sample) it is 1.98. The difference in the width of the confidence limits around the mean between a sample size of 15 and one of 120 is all of 7%. Unless you have some very stringent requirements that difference shouldn’t be much of a cause for concern.

0July 13, 2015 at 5:52 am #198550

Amit Kumar OjhaParticipant@AmitOjha**Include @AmitOjha in your post and this person will**

be notified via email.Thanks Robert for the reply. I understood what you are suggesting me to do. But the problem we are facing is that I can not randomly choose the sample. Fetching the data for any randomly selected data is again difficult. Actually there is a system in which a lot of data is just entered and it is not organized. In order to fetch the data for any sample, lot of steps are required such as confirming on the what the attribute value implies, attribute mapping with scenario, validating the meaning / nomenclature with users etc.

Hence I think the second approach which you have mentioned is more appropriate here. I will try it once and see whether the management is convinced.

Thanks for the help!!!!

0July 20, 2015 at 8:45 am #198578Just my point of view on this: your mamangement could be right into how robust this could be…

What about purpose of your sampling? What do you try to catch with this sampling?

If you try to catch big variations between 1000 lots, 15 to 20 could be enough

But if critical CTQs are based on small sigma variation….

You might go into Power and Sample Size, a key chapter into Minitab and/or any 6 Sigma training

Best regards

S.T.

0July 20, 2015 at 9:08 am #198579

Chris SeiderParticipant@cseider**Include @cseider in your post and this person will**

be notified via email.I don’t understand “can’t randomly sample” in your string of comments above. One can always randomly select something.

Even an approach of getting data across a date range can be as basic as “I’ll select one every X orders for each day of the week”. If multiple locations are part of the process, I’d gather data across all the locations and be sure to get “X orders for every day of the week”.

If it’s paper, it’s really easy to get “random” samples but pulling so many out of the stack.

Just some thoughts to consider from my two cents.

0July 22, 2015 at 11:25 pm #198601

Amit Kumar OjhaParticipant@AmitOjha**Include @AmitOjha in your post and this person will**

be notified via email.@ Chris.. You are absolutely right, however I think that I did not make the problem clear enough. Let me try once more:

See we have a process wherein data (numerical) is entered into two different systems (I understand this is redundant and needs to be eliminated at first) pertaining to few entities(multinomial) which are named differently in the two systems. Now I need to analyse this data in terms of variation between two systems, distribution among entities, any patterns etc. The problem is if I list down the data and try to select randomly I end up in a situation wherein for the selected sample, I can’t fetch the details (owing to nomenclature, inconsistency in data entry etc).Data is huge.

@Sylvain: Right now my objective is Exploratory Data Analysis wherein we need to gauge the problem so as to formulate the Business Case (with Quantification).

Hope I have elaborated the problem well enough for your further valuable inputs.

0July 22, 2015 at 11:35 pm #198602

Amit Kumar OjhaParticipant@AmitOjha**Include @AmitOjha in your post and this person will**

be notified via email.@chris another way of doing it is to first collect all the data which can be easily fetched which would be around 40-50 out of 1000 (I am taking the data only for the current year) and then listing those 40-50 data points and randomly selecting 15-20 out of it.

But then it would not be considered as random sampling. Thats why what I did is to first select some 30-40 data points randomly from 1000 and then ignore those for which data is not easily available, which gave me 15-20 samples.

Awaiting your response!!!

0July 23, 2015 at 5:35 am #198604

Robert ButlerParticipant@rbutler**Include @rbutler in your post and this person will**

be notified via email.Given the clarification of your situation in the most recent post I would recommend dropping all pretense of random sampling and just take the data with the best records (most complete data entry) and run the analysis on that set. In the literature a sample like that is called a convenience sample and, given your inability to acquire a random sample it is the best you can do. I would recommend taking all of the data you can get your hand on, run the analysis, see what you see, write up your findings and in the report detail the issues surrounding sample gathering. You can also look up “convenience sample” on Google and find any number of discussions about the need for and limitation of such sampling methods.

0 - AuthorPosts

You must be logged in to reply to this topic.