Basic Sampling Strategies: Sample vs. Population Data

Information is not readily found at a bargain price. Gathering it is costly in terms of salaries, expenses and time. Taking samples of information can help ease these costs because it is often impractical to collect all the data. Sound conclusions can often be drawn from a relatively small amount of data; therefore, sampling is a more efficient way to collect data. Using a sample to draw conclusions is known as statistical inference. Making inferences is a fundamental aspect of statistical thinking.

Selecting the Most Appropriate Sampling Strategy

There are four primary sampling strategies:

Random sampling
Stratified random sampling
Systematic sampling
Rational sub-grouping

Before determining which strategy will work best, the analyst must determine what type of study is being conducted. There are normally two types of studies: population and process. With a population study, the analyst is interested in estimating or describing some characteristic of the population (inferential statistics).

With a process study, the analyst is interested in predicting a process characteristic or change over time. It is important to make the distinction for proper selection of a sampling strategy. The “I Love Lucy” television show’s “Candy Factory” episode can be used to illustrate the difference. For example, a population study, using samples, would seek to determine the average weight of the entire daily run of candies. A process study would seek to know whether the weight was changing over the day.

Random Sampling

Random samples are used in population sampling situations when reviewing historical or batch data. The key to random sampling is that each unit in the population has an equal probability of being selected in the sample. Using random sampling protects against bias being introduced in the sampling process, and hence, it helps in obtaining a representative sample.

In general, random samples are taken by assigning a number to each unit in the population and using a random number table or Minitab to generate the sample list. Absent knowledge about the factors for stratification for a population, a random sample is a useful first step in obtaining samples.

For example, an improvement team in a human resources department wanted an accurate estimate of what proportion of employees had completed a personal development plan and reviewed it with their managers. The team used its database to obtain a list of all associates. Each associate on the list was assigned a number. Statistical software was used to generate a list of numbers to be sampled, and an estimate was made from the sample.

Stratified Random Sampling

Like random samples, stratified random samples are used in population sampling situations when reviewing historical or batch data. Stratified random sampling is used when the population has different groups (strata) and the analyst needs to ensure that those groups are fairly represented in the sample. In stratified random sampling, independent samples are drawn from each group. The size of each sample is proportional to the relative size of the group.

For example, the manager of a lending business wanted to estimate the average cycle time for a loan application process. She knows there are three types (strata) of loans (large, medium and small). Therefore, she wanted the sample to have the same proportion of large, medium and small loans as the population. She first separated the loan population data into three groups and then pulled a random sample from each group.

Systematic Sampling

Systematic sampling is typically used in process sampling situations when data is collected in real time during process operation. Unlike population sampling, a frequency for sampling must be selected. It also can be used for a population study if care is taken that the frequency is not biased.

Systematic sampling involves taking samples according to some systematic rule – e.g., every fourth unit, the first five units every hour, etc. One danger of using systematic sampling is that the systematic rule may match some underlying structure and bias the sample.

For example, the manager of a billing center is using systematic sampling to monitor processing rates. At random times around each hour, five consecutive bills are selected and the processing time is measured.

Rational Subgrouping

Rational subgrouping is the process of putting measurements into meaningful groups to better understand the important sources of variation. Rational subgrouping is typically used in process sampling situations when data is collected in real time during process operations. It involves grouping measurements produced under similar conditions, sometimes called short-term variation. This type of grouping assists in understanding the sources of variation between subgroups, sometimes called long-term variation.

The goal should be to minimize the chance of special causes in variation in the subgroup and maximize the chance for special causes between subgroups. Subgrouping over time is the most common approach; subgrouping can be done by other suspected sources of variation (e.g., location, customer, supplier, etc.)

For example, an equipment leasing business was trying to improve equipment turnaround time. They selected five samples per day from each of three processing centers. Each processing center was formed into a subgroup.

When using subgrouping, form subgroups with items produced under similar conditions. To ensure items in a subgroup were produced under similar conditions, select items produced close together in time.

Determining Sample Size

This article focused on basic sampling strategies. An analyst must determine which strategy applies to a particular situation before determining how much data is required for the sample. Depending on the question the analyst wants to answer, the amount of sample data needed changes. The analyst should collect enough baseline data to capture an entire iteration (or cycle) of the process.

An iteration should account for the different types of variation seen within the process, such as cycles, shifts, seasons, trends, product types, volume ranges, cycle time ranges, demographic mixes, etc. If historical data is not available, a data collection plan should be instituted to collect the appropriate data.

Factors affecting sample size include:

How confident (accurate) the analyst wants to be that the sample will provide a good estimate of the true population mean. The more confidence required, the greater the sample size needed.
How close (precision) to the “truth” the analyst wants to be. For more precision, a greater sample size is needed.
How much variation exists or is estimated to exist in the population. If there is more variation, a greater sample size is needed.

Sample size calculators are available to make the determination of sample size much easier; it is best, however, that an analyst consults with a Master Black Belt and/or Black Belt coach until he or she is comfortable with determining sample size.