The proliferation of do-it-yourself statistical software is giving some Six Sigma practitioners and other quality professionals, who are not strong in statistics, a false sense of confidence in their ability to collect and analyze data, and then reach sensible conclusions. What some may not realize is that much of the critical work is done long before any data is collected, let alone analyzed.

Before the first research question is formulated, researchers must clarify the aims of their study. Until that is done, there is little chance that a proper format for analysis can be constructed. Before firing up the statistical software, eight important points should be considered as quality professionals move toward producing the best possible data, which, in turn, should allow the most accurate analysis. The examples used here are drawn from the pharmaceutical industry.

Point 1 – Study’s Objective

The most important stage of the pre-analysis process is understanding the aims of the study. For instance, a pharmaceutical company would like to know if Drug A is better then Drug B. This seems a simple and clear-cut question. But is it? One drug can be better than the other in many different ways:

  • Drug A may cure a larger percentage of people than Drug B.
  • Drugs A and B may cure the same percentage, but Drug A does it in half the time.
  • Drug A may have fewer side effects than Drug B.
  • Drugs A and B may be bio-equivalent, but Drug A costs half what Drug B costs.

Each of these study aims calls for different research formulations.

Point 2 – Study Design

The next point is clarifying the study design. Is there a set study hypothesis (i.e., Drug A works faster than Drug B)? Is it a preliminary exploratory setting (i.e., workers in a firm are experiencing health problems), but the cause is, as yet, undetermined? Is it an observational study in which the researcher has no control over the parameters? Or, is it a strict, audited clinical design?

Point 3 – Target Population

Defining the target population is next. In the drug example, this is the group of people who are candidates for using the drugs. It is crucially important do define all different subgroups within this population. For instance, will this drug be prescribed to pregnant woman (remember Thalidomide)?

Point 4 – Sampling Population

The population which will be used for sampling purposes ideally should be the same as the target population, but in many cases a sub-population is used for sampling. For instance, many studies are run on college students as surrogates for young adults. College students, however, generally have a higher socio-economic status and health score than the population at large. These differences can bias study results and must be taken into account. Of course, it is essential to select the best sample size to economically reach the researchers’ aims.

Point 5 – Target Parameters

Next is defining the main and secondary target parameters (i.e., the variables of interest to the researcher). It is important to identify possible confounders (variables which can mask real relationships between the target parameters). This is especially crucial for observational studies.

Researchers also must define what would be considered a clinical difference. A clinical difference is often confused with statistical significance, but in fact the two are entirely different. To continue the drug example, suppose a drug company will produce Drug A only if it can be proven to be 50 percent faster acting than Drug B. A clinical study is done and it finds that Drug A is only 25 percent faster than Drug B. Thus, while the result of the study is highly significant statistically, the clinical difference is not good enough to meet the drug company’s requirements.

Point 6 – Control Group

The sixth point is choosing a proper control group, if needed. A control group is a necessity for cause and effect studies. For example, a certain working condition is thought to cause excessive health problems. To test this hypothesis, workers exposed to the suspected hazard must be compared to similar workers who are unexposed.

Point 7 – Randomization

Setting the proper randomization needed to find test subjects is important. Good randomization evens out chance differences between test subjects in order to avoid biasing study results. Bad randomization does the opposite. For example, it was suspected that a certain drug had a dampening side effect on subjects’ cognitive skills. In order to test this premise, the researcher randomly selected two freshman classes from an academic institute. One class was given the drug, the other a placebo. The researcher wasn’t aware that the drug group had mastered a skill not yet undertaken by the placebo group. This chance difference invalidated the experiment’s findings.

Point 8 – Questionnaire

Another topic that should be well explored is questionnaire planning. The best planned study can fail due to poorly collected data. One physician collected annual data on workers’ health. The EKG diagnosis was coded in “free text.” The range of values included “OK, NEG, N.F., NORM, NOR, etc.” When the findings were positive, the coding was even more inventive. As a result, it was impossible to statistically analyze the EKG findings.

It is prudent, when possible, to leave certain types of data-gathering to professionals. Even such things as a computerized database against which new data is to be measured needs to be evaluated. In one case, a perceived change in workers’ cholesterol levels was caused by one laboratory being used when the database was compiled and another for newer data. There was no change in the workers’ health.

About the Author